The postprocessing workflow

Once the CHEEREIO ensemble is complete, we need to postprocess the output into a useable form to analyze. On this page we describe how the postprocessing workflow works, and how to use the postprocessing functions to do additional customized analysis.

How to run the default postprocessing workflow

The postprocessing workflow works automatically by using settings in ens_config.json, and can be run by navigating to the postprocess folder in the CHEEREIO code directory and executing postprocess.batch with an sbatch command (i.e. sbatch postprocess.batch). Modify the script first for your cluster and memory requirements.

Another default workflow that users can use is in snapshot.sh. This script can be executed at any time during CHEEREIO run time to make a few plots that give an idea of CHEEREIO run status. See Monitoring the run progress for details.

The postprocessed figures

If the postprocessing workflow runs successfully, then the following figures and movies will be present in the ensemble postprocess/ folder.

actual_observations and simulated_observations movies

These movies show actual observations and simulated observations on the GEOS-Chem grid, and their evolution over time. One of each of these movies is made for every observed species.

Attention

There are known issues with the colorbar in these movies, and will be debugged soon (DP: 2/3/2023).

Observation counts movies (total_aggregated_observation_count and total_raw_observation_counts)

These movies show the total counts of observations (both raw and aggregated to the GEOS-Chem grid) and their evolution over time. One of each of these movies is made for every observed species.

Mean observation and observation counts plots (weighted_mean_obs, total_obs_count)

Maps showing the average observations in the postprocessing period on the GEOS-Chem grid, as well as the number of observations averaged to make the map. One of each of these plots is made for each observation type.

assim_minus_obs and ctrl_minus_obs plots

The assim_minus_obs plots show the difference between the ensemble mean simulated observations and the mean observations over the entire postprocessing period, plotted as a map; one is generated per observed species. The ctrl_minus_obs show the same, but the difference is between the control simulation and observations.

The mean bias and standard deviation of these plots is stored in the stats_out.txt file in the postprocess folder of the CHEEREIO code directory.

Note

This is a key figure output from postprocessing; it indicates if the assimilation has successfully reduced bias relative to observations.

Scalefactor map plots (various time slices)

These plots are titled {SPECIES}_{DATE}_scalefactor.png and they show scaling factor plots saved at each time step, as generated by the plotScaleFactor() function. Users can generate plots at monthly resolution by setting scalefactor_plot_freq to monthly.

Scalefactor map movies

These movies show maps of scale factors evolving over time, one set for each of the emissions assimilated by CHEEREIO. Ensemble maximum, mean, minimum, range, and standard deviation are all generated and plotted. For global simulations, we also show several regions whose files are named with a corresponding suffix: Australia, CONUS, EastChina, Europe, India, South America, and Southern Africa. The global domain has no suffix.

Note

This is a key figure output from postprocessing; it indicates if emissions scaling factors follow sensible spatial patterns, have a reasonable order of magnitude, and change by reasonable amounts from assimilation period to assimilation perid.

SpeciesConc map movies

These movies show maps of surface concentrations evolving over time, one set for each of the concentrations in the CHEEREIO control vector. Ensemble maximum, mean, minimum, range, and standard deviation are all generated and plotted. For global simulations, we also show several regions whose files are named with a corresponding suffix: Australia, CONUS, EastChina, Europe, India, South America, and Southern Africa. The global domain has no suffix.

HEMCO diagnostic map plots (various time slices)

These plots are titled {HEMCO_Diag}_{DATE}_control.png, {HEMCO_Diag}_{DATE}_ensemble_mean.png, and {HEMCO_Diag}_{DATE}_ensemble_std.png. They show emissions plots (from HEMCO Diag) saved at each time step, as generated by the plotEmissions() function. The {HEMCO_Diag}_{DATE}_control.png plot shows the map of the HEMCO diagnostic for the control run (no assimilation); the {HEMCO_Diag}_{DATE}_ensemble_mean.png plot shows the map of the HEMCO diagnostic for the ensemble mean; and the {HEMCO_Diag}_{DATE}_ensemble_mean.png plot shows the map of the HEMCO diagnostic for the ensemble standard deviation. Users can generate plots at monthly resolution by setting scalefactor_plot_freq to monthly.

Observation ts compare plots

These plots are titled observations_ts_compare_{OBSERVER_KEY}.png and observations_ts_compare_{OBSERVER_KEY}_w_control.png, one each for each observed species. These plots show a timeseries depicting (1) the mean observation averaged across the entire spatial domain, (2) the ensemble mean simulated observation averaged across the entire spatial domain (with standard deviation plotted with a dotted line), and (3) the control simulated observation averaged across the spatial domain (only for the _w_control plots).

Note

This is a key figure output from postprocessing; it indicates if the assimilation has successfully tracked observations.

Surface mean timeseries

These plots, one generated for each species in the control vector (CONTROL_VECTOR_CONC within ens_config.json), show the mean surface concentration value over time (with the ensemble standard deviation showed with a dotted line). The files are named with pattern surfmean_ts_{SPECIES}.png.

Emissions timeseries

These plots, one generated for each diagnostic in the HEMCO Diagnostic plot list (hemco_diags_to_process within ens_config.json), show the post-assimilation total emissions against the prior total emissions over time. The files are named with pattern timeseries_totalemissions_{HEMCO_Diag}_against_prior.png.

The postprocessed data files

After the default postprocessing run, the following data files will be present in the ensemble postprocess/ folder.

bigy_arrays_for_plotting.pkl

A dictionary containing gridded observations and simulated observations. Output by the makeBigYArrays() function, which contains more details.

bigY.pkl

A dictionary containing observations and simulated observations as arrays. Output by the makeYEachAssimPeriod() function, which contains more details.

Scalefactor .nc files

NetCDF files containing aggregated scaling factors from across the ensemble; ensemble number is stored in a new dimension Ensemble. Output by the combineScaleFactors() function, which contains more details.

combined_HEMCO_diagnostics.nc file

NetCDF files containing aggregated emissions from across the ensemble; ensemble number is stored in a new dimension Ensemble. Output by the combineHemcoDiag() function, which contains more details.

control_HEMCO_diagnostics.nc file

NetCDF files containing aggregated emissions from the control run. Output by the combineHemcoDiagControl() function, which contains more details.

controlvar_pp.nc file

NetCDF file containing the ensemble-wide surface species concentrations for species in the control vector (CONTROL_VECTOR_CONC within ens_config.json). Output by the makeDatasetForEnsemble() function, which contains more details.

The postprocessing API

Here we list the postprocessing functions, which are stored in the postprocess_tools.py and map_tools.py files in the postprocessing folder. These functions are called in the default postprocessing workflow, and can also be used

Functions in postprocess_tools

Here is the documentation for the postprocessing toolkit, present in the postprocess_tools.py file in the postprocessing folder.

globDirs(ensemble_dir, removeNature=False, includeOutputDir=False)

For a given ensemble directory, get all of the ensemble member run directory paths, their directory names, and their numeric labels and returns them in sorted order.

Parameters
  • ensemble_dir (str) – Path to ensemble run directory.

  • removeNature (bool) – True or False, should we remove ensemble run directory 0 (control run).

  • includeOutputDir (bool) – True or False, write paths to go to OutputDir (where the GEOS-Chem model history is stored) or the individual top level ensemble run directory.

Returns

List of (1) list of paths to ensemble run directory members; (2) list of ensemble run directory names; and (3) list of numeric directory labels.

Return type

list

globSubDir(hist_dir, timeperiod=None, hourlysub=6)

For a given GEOS-Chem output directory, get all the SpeciesConc files in order for a given time period and return the filenames as a list.

Parameters
  • hist_dir (str) – Path to GEOS-Chem output directory.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

  • hourlysub (int) – Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours.

Returns

List of filenames for SpeciesConc files.

Return type

list

globSubDirLevelEdge(hist_dir, timeperiod=None, hourlysub=6)

As with globSubDir(), but for LevelEdgeDiag files.

Parameters
  • hist_dir (str) – Path to GEOS-Chem output directory.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

  • hourlysub (int) – Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours.

Returns

List of filenames for LevelEdgeDiag files.

Return type

list

combineScaleFactors(ensemble_dir, output_dir, timeperiod=None, flag_snapshot=False, return_not_write=False)

Combine emissions scaling factors from across the ensemble and save (or return) them as a single NetCDF or xarray DataSet, with a new dimension called “Ensemble” representing ensemble number. One dataset is saved or returned for each scale factor type.

Parameters
  • ensemble_dir (str) – Path to CHEEREIO ensemble directory.

  • output_dir (str) – Path to where the combined scaling factor NetCDF should be saved.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

  • flag_snapshot (bool) – Flag the output file as a snapshot (True only by the CHEEREIO snapshot script).

  • return_not_write (bool) – Return the combined dataset rather than writing it as a NetCDF file.

Returns

If return_not_write is True, a dictionary containing the scale factor names as keys and xarray DataSets with the combined scaling factors as values.

Return type

dict

combineHemcoDiag(ensemble_dir, output_dir, timeperiod=None)

Combine HEMCO Diagnostics (e.g. emissions) from across the ensemble and save them as a single NetCDF, with a new dimension called “Ensemble” representing ensemble number.

Parameters
  • ensemble_dir (str) – Path to CHEEREIO ensemble directory.

  • output_dir (str) – Path to where the combined HEMCO diagnostic NetCDF should be saved.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

combineHemcoDiagControl(ensemble_dir, output_dir, timeperiod=None)

Combine HEMCO Diagnostics (e.g. emissions) from the control run as a single NetCDF.

Parameters
  • ensemble_dir (str) – Path to CHEEREIO ensemble directory.

  • output_dir (str) – Path to where the combined HEMCO diagnostic NetCDF should be saved.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

makeDatasetForDirectory(hist_dir, species_names, timeperiod=None, hourlysub=6, subset_rule='SURFACE', fullpath_output_name=None)

Combine GEOS-Chem species concentration output from a single ensemble member as a single dataset and either save to a NetCDF file or return.

Parameters
  • hist_dir (str) – Path to GEOS-Chem output directory.

  • species_names (list) – List of species that we would like to process.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

  • hourlysub (int) – Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours.

  • subset_rule (str) – Which vertical level(s) to save data from. SURFACE is the surface, 850 is the 850hPa level, and ALL is all vertical data.

  • fullpath_output_name (str) – Path and filename of the NetCDF file to which we should save the combined data. If None, return the data instead

Returns

If fullpath_output_name is None, an xarray DataSet with the combined concentrations.

Return type

DataSet

makeDatasetForEnsemble(ensemble_dir, species_names, timeperiod=None, hourlysub=6, subset_rule='SURFACE', fullpath_output_name=None)

Combine GEOS-Chem species concentration output from across the ensemble as a single dataset, with a new Ensemble dimension denoting ensemble member number, and either save to a NetCDF file or return.

Parameters
  • ensemble_dir (str) – Path to CHEEREIO ensemble directory.

  • species_names (list) – List of species that we would like to process.

  • timeperiod (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

  • hourlysub (int) – Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours.

  • subset_rule (str) – Which vertical level(s) to save data from. SURFACE is the surface, 850 is the 850hPa level, and ALL is all vertical data.

  • fullpath_output_name (str) – Path and filename of the NetCDF file to which we should save the combined data. If None, return the data instead

Returns

If fullpath_output_name is None, an xarray DataSet with the combined concentrations.

Return type

DataSet

makeYEachAssimPeriod(path_to_bigy_subsets, startdate=None, enddate=None, fullpath_output_name=None)

Combine the intermediate Y datasets, as output by The HIST Ensemble class, into a dictionary for the entire ensemble run. These Y datasets contain simulated observations from GEOS-Chem aligned with actual observations, along with other metadata. Returns a combined dictionary, where the keys are timestamps.

Parameters
  • path_to_bigy_subsets (str) – Path to where the intermediate values of Y are saved out by the ensemble (postprocess/bigy in your ensemble installation).

  • startdate (datetime) – Start date for when we should process Y datasets.

  • enddate (datetime) – End date for when we should process Y datasets.

  • fullpath_output_name (str) – Path and filename of the pickle file to which we should save the combined Y dataset. If None, return the data instead.

Returns

If fullpath_output_name is None, an dictionary with the combined Y datasets. The keys are timestamps and the values are the bigY data dictionary for the given day – entries in these include ‘Latitude’, ‘Longitude’, ‘Observations’, and the species name. See The HIST Ensemble class for details.

Return type

dict

plotSurfaceCell(ds, species_name, latind, lonind, outfile=None, unit='ppt', includesNature=False)

Plot a timeseries of the surface concentrations for a single grid cell.

Parameters
  • ds (DataSet) – DataSet of the combined ensemble species concentrations, output by makeDatasetForEnsemble().

  • species_name (list) – Species to plot.

  • latind (int) – Index of latitude of cell we will plot.

  • lonind (int) – Index of longitude of cell we will plot.

  • outfile (str) – Name of image file containing plot. If None, display the plot instead.

  • unit (str) – either ‘ppm’, ‘ppb’, or ‘ppt’; CHEEREIO will multiply the GEOS-Chem mole-mole ratio to get these units.

  • includesNature (bool) – True or False, include the no-assimilation control run in plot.

plotSurfaceMean(ds, species_name, outfile=None, unit='ppt', includesNature=False)

Plot a timeseries of the average surface concentrations across the domain.

Parameters
  • ds (DataSet) – DataSet of the combined ensemble species concentrations, output by makeDatasetForEnsemble().

  • species_name (list) – Species to plot.

  • outfile (str) – Name of image file containing plot. If None, display the plot instead.

  • unit (str) – either ‘ppm’, ‘ppb’, or ‘ppt’; CHEEREIO will multiply the GEOS-Chem mole-mole ratio to get these units.

  • includesNature (bool) – True or False, include the no-assimilation control run in plot.

tsPlotTotalEmissions(ds_ensemble, ds_prior, collectionName, useLognormal=False, timeslice=None, outfile=None)

Plot a timeseries of a HEMCO diagnostic of interest and compare it to control (no assimilation).

Parameters
  • ds_ensemble (DataSet) – DataSet of the combined ensemble HEMCO diagnostics, output by combineHemcoDiag().

  • ds_prior (DataSet) – DataSet of the control HEMCO diagnostics, output by combineHemcoDiagControl().

  • collectionName (list) – Name of HEMCO diagnostic collection to plot.

  • useLognormal (bool) – True or False, are we using lognormal errors for emissions? This affects how the averaging is done. Usually supplied by lognormalErrors in ens_config.json.

  • timeslice (list) – A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period.

  • outfile (str) – Name of image file containing plot. If None, display the plot instead.

tsPlotSatCompare(bigY, species, numens, unit='ppb', observer_name='Observations', useControl=False, outfile=None)

Plot a timeseries of simulated observations from the ensemble against the true observations, along with the control (no assimilation) simulation if requested.

Parameters
  • bigY (dict) – Dictionary containing the Y dictionaries across the entire ensemble. Output from makeYEachAssimPeriod().

  • species (str) – Species to plot

  • numens (int) – Number of ensemble members.

  • unit (str) – Unit to use on the plot label; usually stored in the OBSERVATION_UNITS entry in ens_config.json.

  • observer_name (str) – Name of the observer to use on the plot legend; usually stored in the OBS_TYPE entry in ens_config.json.

  • useControl (bool) – True or False, should we include the control simulation as a separate line on the plot?

  • outfile (str) – Name of image file containing plot. If None, display the plot instead.

tsPlot(time, ensmean, enssd, species_name, unit, nature=None, priortime=None, prior=None, outfile=None)

Utility method to plot a timeseries of ensemble data, with dotted standard deviation, along with control the (no assimilation) simulation data if requested.

Parameters
  • time (array) – Array of time values to plot on the x axis.

  • ensmean (array) – Array of the ensemble mean value of the data of interest.

  • enssd (array) – Array of the ensemble standard deviation value of the data of interest.

  • species_name (str) – Name of species for plot title

  • unit (str) – Unit to use on the plot label.

  • nature (array) – Optional array of values, matching ensmean in time, which will be labeled ‘Nature’ on the plot.

  • priortime (array) – Optional array of time values for the prior (no assimilation), which will be labeled ‘Prior’ on the plot.

  • prior (array) – Optional array of values for the prior, corresponding with priortime in time, which will be labeled ‘Prior’ on the plot.

  • outfile (str) – Name of image file containing plot. If None, display the plot instead.

makeBigYArrays(bigy, gclat, gclon, nEnsemble, postprocess_save_albedo=False, useControl=False)

Take the aggregated big Y data dictionary, as output by makeYEachAssimPeriod(), and processes it into a dictionary containing gridded NumPy arrays. The gridded NumPy arrays are used for plotting.

Parameters
  • bigY (dict) – Dictionary containing the Y dictionaries across the entire ensemble. Output from makeYEachAssimPeriod().

  • gclat (array) – Latitude from GEOS-Chem grid.

  • gclon (array) – Longitude from GEOS-Chem grid.

  • nEnsemble (int) – Number of ensemble members.

  • postprocess_save_albedo (bool) – True or False, should we process albedo data from TROPOMI CH4?

  • useControl (bool) – True or False, should we process simulated observation data from the control run?

Returns

A dictionary with keys “obscount”, containing the total number of observations; “obscount_avg”, containing the total number of observations after averaging to the GEOS-Chem grid; “obs”, containing actual observations; “sim_obs”, containing simulated observations from the GEOS-Chem ensemble mean; “control”, containing simulated observations from the GEOS-Chem control run if requested; and a list of dates under “dates” and species under “species”. The NumPy arrays have dimension (date, species, lat, lon), indexed by the “dates” dictionary entry, “species” dictionary entry, gclat, and gclon arrays respectively.

Return type

dict

Functions in map_tools

Here is the documentation for the postprocessing mapmaking toolkit, present in the map_tools.py file in the postprocessing folder.

plotMap(m, lat, lon, flat, labelname, outfile, clim=None, cmap=None, useLog=False, minval=None)

Utility method to plot a 2D dataset on a map.

Parameters
  • m (Basemap) – Basemap object for plotting.

  • lat (array) – Latitude values.

  • lon (array) – Longitude values.

  • flat (array) – 2D array of data to plot.

  • labelname (str) – Name of label for color bar.

  • outfile (str) – Name of image file containing plot.

  • clim (list) – List of the minimum and maximum values for the colorbar. If None, use the maximum and minimum values of the dataset.

  • cmap (colormap) – Colormap to use for plot; defaults to jet if none supplied.

  • useLog (bool) – True or False, use a log scale for the colorbar.

  • minval (float) – Minimum value to include in the plot; below this value set to nan. No minimum if None.

plotEmissions(m, lat, lon, ppdir, hemco_diags_to_process, plotWithLogScale=True, min_emis=None, min_emis_std=None, plotcontrol=True, useLognormal=False, aggToMonthly=True)

Plot emissions, listed in hemco_diags_to_process, on a map and write to file. Also plot ensemble standard deviations and control emissions.

Parameters
  • m (Basemap) – Basemap object for plotting.

  • lat (array) – Latitude values.

  • lon (array) – Longitude values.

  • ppdir (str) – Path to postprocess directory. This function will read in the data files output from combineHemcoDiag() and combineHemcoDiagControl() and save the files here.

  • hemco_diags_to_process (list) – List of hemco diagnostics to plot. Usually given by hemco_diags_to_process in ens_config.json.

  • plotWithLogScale (bool) – True or False, use a log scale for the colorbar.

  • min_emis (float) – Minimum emissions value to include in the plot; below this value set to nan. No minimum if None.

  • min_emis_std (float) – Minimum emissions standard deviation value to include in the plots of emissions standard deviations; below this value set to nan. No minimum if None.

  • plotcontrol (bool) – True or False, should we plot the control emission maps?

  • useLognormal (bool) – True or False, are we using lognormal errors for emissions? This affects how the averaging is done. Usually supplied by lognormalErrors in ens_config.json.

  • aggToMonthly (bool) – True or False, should we average emissions data to monthly resolution when we plot? Uses agg_to_monthly() to aggregate.

plotScaleFactor(m, lat, lon, ppdir, useLognormal=False, aggToMonthly=True)

Plot emissions scaling factors on a map and write to file.

Parameters
  • m (Basemap) – Basemap object for plotting.

  • lat (array) – Latitude values.

  • lon (array) – Longitude values.

  • ppdir (str) – Path to postprocess directory. This function will read in the data files output from combineScaleFactors() and save the files here.

  • useLognormal (bool) – True or False, are we using lognormal errors for emissions? This affects how the averaging is done. Usually supplied by lognormalErrors in ens_config.json.

  • aggToMonthly (bool) – True or False, should we average emissions data to monthly resolution when we plot? Uses agg_to_monthly() to aggregate.

agg_to_monthly(dates, to_agg)

Aggregate a NumPy array to monthly resolution and return.

Parameters
  • dates (array) – An array of dates, which are used to aggregate to monthly resolution.

  • to_agg (array) – A three or four dimensional array to aggregate to monthly resolution. If three-dimensional, we assume time is the 1st entry; if four-dimensional, we assume time is the second entry.

Returns

A list with (1) the monthly dates, and (2) the aggregated array now at monthly resolution.

Return type

list

Adding a new observation field to the postprocessing workflow

For many observations, there are quantities other than the observation value itself that are relevant in the data assimilation problem. For example, albedo can impact remote sensing retrievals; users might want to automatically plot/store albedo values to see if they correlate with values of concern.

Fortunately for you, CHEEREIO can handle any of these fields automatically without any user code! In the course of developing your Observation Operator, you might stored fields in an ObsData object other than observations (see The ObsData class for a description of this class). For example, the ObsPack operator stores fields such as altitude. These data are added to ObsData objects using the addData function with custom keys. Here is a relevant line of code from the ObsPack operator:

toreturn.addData(utc_conv=ObsPack['utc_conv'],altitude=ObsPack['altitude'],pressure=GC['pressure'].values,obspack_id=ObsPack['obspack_id'],platform=ObsPack['platform'],site_code=ObsPack['site_code'])

Any of these fields are automatically available to the postprocessing workflow. No code is required. The postprocessing routine will automatically try and plot any of these fields that the user lists under EXTRA_OBSDATA_FIELDS_TO_SAVE_TO_BIG_Y and EXTRA_OBSDATA_FIELDS_TO_REGRID_AND_PLOT in the ens_config.json file. See the Postprocessing settings page for details on how these settings work. Again, there is no code required on your part!