Installing CHEEREIO

Prerequisites

Because CHEEREIO wraps GEOS-Chem, it requires that your computing environment has the appropriate modules loaded to compile and run GEOS-Chem. Like any model in Earth Science, this task alone can be quite challenging. See the GEOS-Chem Wiki for the hardware and software requirements. In terms of hardware requirements, you should multiply the recommended resources by roughly 32, which is a standard ensemble size. Because the ensemble is handled as a “job array”, this resource requirement will be spread out across 32 jobs that are loosely coordinated; in short, memory can be spread across multiple nodes on your compute cluster.

Impact on RAM can be quite intense as CHEEREIO will need to load many large NETCDF files into memory at once. By adjusting the MaxPar setting in ens_config.json you can limit the number of columns calculated simultaneously on one job allocation and thus the RAM load. If you are working with TROPOMI, particularly products with a large numbers of observations per day or a high vertical resolution (e.g. NO2), it is recommended that the setting LOW_MEMORY_TROPOMI_AVERAGING_KERNEL_CALC is set to True in ens_config.json. This setting changes the averaging kernel calculation algorithm to a version that takes longer but does not form large matrices, potentially saving vast amounts of memory.

A sample environment that allows NETCDF libraries to run in both Python and GEOS-Chem is supplied by cheereio.env in the environments folder. This environment is designed for the Harvard cluster and will need to be adjusted for other machines.

Beyond the standard GEOS-Chem requirements, CHEEREIO currently requires the SLURM resource manager to handle batch submission. This is because of SLURM’s support for job arrays. CHEEREIO also requires the following modules to be installed: jq module for JSON support, GNU parallel for handling LETKF column-wise updates efficiently, and Anaconda-managed Python with the “cheereio” conda environment, corresponding with the cheereio.yaml file from the Github repository. With all these modules loaded in the software environment, CHEEREIO should run without a hitch.

Steps to install

CHEEREIO installation should be relatively simple if you already have installed GEOS-Chem version 13.0.0 or later on your machine. Follow the steps below:

  1. Clone the CHEEREIO Github repository into a permanent directory on your machine.
    1. Install a conda environment for CHEEREIO updates from the cheereio.yaml file by following this guide. This file is given in the environments folder.

    2. If you are not working with TROPOMI or OMI NO2data, you will likely have to modify the code in the repository to allow CHEEREIO to read in your dataset of choice. I would encourage you to add your custom observation operator on a new branch and make a pull request in the main repository. This will allow the community to make use of your operator and speed up the rate of new research.

  2. Clone the GCClassic Github repository within the CHEEREIO folder and update the submodules. CHEEREIO requires GEOS-Chem version 13.0.0 or later.

  3. Modify the ens_config.json configuration file according to your needs. A considerable amount of scientific thought should go into the modification, as ens_config.json encodes assumptions about what species your observations will allow you to update. See The ensemble configuration file for a detailed guide on how to prepare this file so you can get the best results with CHEEREIO.

  4. Deploy the ensemble, after reading the The ensemble configuration file and The setup ensemble script pages to understand how this procedure works. Do so by following these steps:
    1. Run setup_ensemble.sh with SetupTemplateRundir set to true and all other main switches set to false. This will create a template run directory that has some important differences from standard GEOS-Chem run directories. input.geos, for example, will have empty tags set at key locations that will allow CHEEREIO to resubmit GEOS-Chem runs for different time periods. HEMCO_Config.rc is represented by two template files. HEMCO_Config_SPINUP_NATURE_TEMPLATE.rc is for spinup and “nature” simulations, neither of which include randomized scaling factors. HEMCO_Config.rc is for ensemble members, all of whom will have perturbed emissions. References to gridded scaling factors are added at key lines in this config file.

    2. VERIFY THAT HEMCO_Config.rc IS CORRECT. In particular, while CHEEREIO does support multiple emissions updates within the same species (for example, updating NO agricultural emissions separately from the rest of NO emissions), it is not capable of distinguishing these kinds of species on its own. Instead, it will just add scaling factor references wherever the species of interest emerges. In this case, user must delete the correct duplicated scaling factor. The user should also pay special attention that scaling factors are not applied to inapplicable sources, such as negative emissions from soil uptake. See the GEOS-Chem Wiki Page for HEMCO_Config.rc for more information.

    3. Make any additional changes to the template directory that should be reflected in the ensemble.

    4. Re-run setup_ensemble.sh with SetupTemplateRundir set to false and all other main switches set to true. However, if you are not using a Spinup run, you should set the SetupSpinupRun switch to false. This will take a while, as it involves compiling GEOS-Chem and copying and modifying large files. You can set the CompileTemplateRundir switch to false and compile yourself first if you have custom compile-time settings you wish to invoke.

  5. Your ensemble is now built and deployed. If you are using the “separate run” form of ensemble spinup, common for pre-13.4 GEOS-Chem (and indicated by setting DO_ENS_SPINUP to true in ens_config.json), you can cd to the ensemble_runs folder and execute the run_ensspin.sh file to execute ensemble spinup. After this completes, or if DO_ENS_SPINUP is turned off, you can execute the run_ens.sh file. I prefer to run both of these shell scripts with the command format nohup bash run_ens.sh &. The SLURM job array is now submitted.