Installing CHEEREIO

Prerequisites

Because CHEEREIO wraps GEOS-Chem, it requires that your computing environment has the appropriate modules loaded to compile and run GEOS-Chem. Like any model in Earth Science, this task alone can be quite challenging. See the GEOS-Chem Wiki for the hardware and software requirements. In terms of hardware requirements, you should multiply the recommended resources by roughly 32, which is a standard ensemble size. Because the ensemble is handled as a “job array”, this resource requirement will be spread out across 32 jobs that are loosely coordinated; in short, memory is spread across multiple nodes on your compute cluster. Impact on memory can be quite intense as CHEEREIO will need to load many NetCDF files into memory at once and form large matrices. By adjusting the MaxPar setting in ens_config.json you can limit the number of columns calculated simultaneously on one job allocation and thus the memory load. A sample environment that allows NetCDF libraries to run in both Python and GEOS-Chem is supplied by cheereio.env in the environments/ folder. This environment is designed for the Harvard cluster and will need to be adjusted for other machines.

Beyond the standard GEOS-Chem requirements, CHEEREIO currently requires the SLURM resource manager to handle batch submission. This is because of SLURM’s support for job arrays. CHEEREIO also requires the following modules to be installed: jq module for JSON support, GNU parallel for handling LETKF column-wise updates efficiently, and Anaconda-managed Python with the “cheereio” conda environment or equivalent installed, corresponding with the cheereio.yaml file from the Github repository. With all these modules loaded in the software environment, CHEEREIO should run without a hitch.

Steps to install

CHEEREIO installation should be relatively simple if you already have installed GEOS-Chem version 13.0.0 or later on your machine. Follow the steps below:

  1. Clone the CHEEREIO Github repository into a permanent directory on your machine.
    1. Install a conda environment for CHEEREIO updates from the cheereio.yaml file by following this guide. This file is given in the environments folder.

    2. Depending on what observations you want to use with CHEEREIO, you might have to add a new observation operator. CHEEREIO has a very specific expected format for observation operators, which is detailed on the Workflow to add a new observation operator section. If you develop a new observation operator, I would strongly encourage you to add it on a new branch in the CHEEREIO git tree and make a pull request in the main repository. This will allow the community to make use of your operator and speed up the rate of new research.

  2. Clone the GCClassic Github repository within the CHEEREIO folder and update the submodules. CHEEREIO requires GEOS-Chem version 13.0.0 or later.

  3. Modify the ens_config.json configuration file according to your needs. A considerable amount of scientific thought should go into the modification, as ens_config.json encodes assumptions about what species and emissions your observations will allow you to update. See Configuring your simulation for a detailed guide on how to prepare this file so you can get the best results with CHEEREIO.

  4. Deploy the ensemble, after reading the Configuring your simulation page to understand how this procedure works. Do so by following these steps:
    1. Run setup_ensemble.sh with SetupTemplateRundir set to true and all other main switches set to false. This will create a template run directory, which is almost identical to a standard GEOS-Chem run directory but with some important differences. input.geos, for example, will have empty tags set at key locations that will allow CHEEREIO to resubmit GEOS-Chem runs for different time periods. HEMCO_Config.rc is represented by two template files. HEMCO_Config_SPINUP_NATURE_TEMPLATE.rc is for spinup and “nature” simulations, neither of which include randomized scaling factors. HEMCO_Config.rc is for ensemble members, all of whom will have perturbed emissions. References to gridded scaling factors are added at key lines in this config file.

    2. VERIFY THAT HEMCO_Config.rc IS CORRECT. Depending on the simulation you want to run, there are some subtleties that you need to check to ensure that CHEEREIO will work the way you expect it to. For more information, see the Verifying HEMCO Config after initialization page.

    3. Make any additional changes to the template run directory that you would like to see reflected in the ensemble.

    4. Re-run setup_ensemble.sh with SetupTemplateRundir set to false and all other main switches set to true. However, if you are not using a global spinup run (and are supplying your own spun up restart file), you should set the SetupSpinupRun switch to false. This will take a few minutes, as it involves compiling GEOS-Chem and copying and modifying large files. You can set the CompileTemplateRundir switch to false and compile yourself first if you have custom compile-time settings you wish to invoke.

  5. Your ensemble is now built and deployed. If you are using the “separate run” form of ensemble spinup, which is recommended for assimilation of species with longer lifetimes (and indicated by setting DO_ENS_SPINUP to true in ens_config.json), you can cd to the ensemble_runs folder and execute the run_ensspin.sh file to execute ensemble spinup. After this completes, or if DO_ENS_SPINUP is turned off, you can execute the run_ens.sh file. I prefer to run both of these shell scripts with the command format nohup bash run_ens.sh &. The SLURM job array is now submitted. For more information on how to run the ensemble, and on how to set up the two forms of ensemble spinup, see the Running the ensemble page.

  6. While the ensemble is running, you can execute the control run simulation from the control_run folder by submitting the script with the name RUNNAME_Control.run via sbatch. This is equivalent to running GEOS-Chem without assimilation, and is useful for doing postprocessing analyses. The control_run folder is created by setting the SetupControlRun switch to true in setup_ensemble.sh.