.. _Postprocessing workflow: The postprocessing workflow ========== Once the CHEEREIO ensemble is complete, we need to postprocess the output into a useable form to analyze. On this page we describe how the postprocessing workflow works, and how to use the postprocessing functions to do additional customized analysis. How to run the default postprocessing workflow ------------- The postprocessing workflow works automatically by using settings in ``ens_config.json``, and can be run by navigating to the ``postprocess`` folder in the CHEEREIO code directory and executing ``postprocess.batch`` with an sbatch command (i.e. ``sbatch postprocess.batch``). Modify the script first for your cluster and memory requirements. Another default workflow that users can use is in ``snapshot.sh``. This script can be executed at any time during CHEEREIO run time to make a few plots that give an idea of CHEEREIO run status. See :ref:`Monitoring run` for details. The postprocessed figures ~~~~~~~~~~~~~ If the postprocessing workflow runs successfully, then the following figures and movies will be present in the ensemble ``postprocess/`` folder. .. option:: actual_observations and simulated_observations movies These movies show actual observations and simulated observations on the GEOS-Chem grid, and their evolution over time. One of each of these movies is made for every observed species. .. attention:: There are known issues with the colorbar in these movies, and will be debugged soon. .. option:: Observation counts movies (total_aggregated_observation_count and total_raw_observation_counts) These movies show the total counts of observations (both raw and aggregated to the GEOS-Chem grid) and their evolution over time. One of each of these movies is made for every observed species. .. option:: Mean observation and observation counts plots (weighted_mean_obs, total_obs_count) Maps showing the average observations in the postprocessing period on the GEOS-Chem grid, as well as the number of observations averaged to make the map. One of each of these plots is made for each observation type. .. option:: assim_minus_obs and ctrl_minus_obs plots The assim_minus_obs plots show the difference between the ensemble mean simulated observations and the mean observations over the entire postprocessing period, plotted as a map; one is generated per observed species. The ctrl_minus_obs show the same, but the difference is between the control simulation and observations. The mean bias and standard deviation of these plots is stored in the ``stats_out.txt`` file in the ``postprocess`` folder of the CHEEREIO code directory. .. note:: This is a key figure output from postprocessing; it indicates if the assimilation has successfully reduced bias relative to observations. .. option:: Scalefactor map plots (various time slices) These plots are titled ``{SPECIES}_{DATE}_scalefactor.png`` and they show scaling factor plots saved at each time step, as generated by the :py:func:`plotScaleFactor` function. Users can generate plots at monthly resolution by setting ``scalefactor_plot_freq`` to ``monthly``. .. option:: Scalefactor map movies These movies show maps of scale factors evolving over time, one set for each of the emissions assimilated by CHEEREIO. Ensemble maximum, mean, minimum, range, and standard deviation are all generated and plotted. For global simulations, we also show several regions whose files are named with a corresponding suffix: Australia, CONUS, EastChina, Europe, India, South America, and Southern Africa. The global domain has no suffix. .. note:: This is a key figure output from postprocessing; it indicates if emissions scaling factors follow sensible spatial patterns, have a reasonable order of magnitude, and change by reasonable amounts from assimilation period to assimilation perid. .. option:: SpeciesConc map movies These movies show maps of surface concentrations evolving over time, one set for each of the concentrations in the CHEEREIO control vector. Ensemble maximum, mean, minimum, range, and standard deviation are all generated and plotted. For global simulations, we also show several regions whose files are named with a corresponding suffix: Australia, CONUS, EastChina, Europe, India, South America, and Southern Africa. The global domain has no suffix. .. option:: HEMCO diagnostic map plots (various time slices) These plots are titled ``{HEMCO_Diag}_{DATE}_control.png``, ``{HEMCO_Diag}_{DATE}_ensemble_mean.png``, and ``{HEMCO_Diag}_{DATE}_ensemble_std.png``. They show emissions plots (from HEMCO Diag) saved at each time step, as generated by the :py:func:`plotEmissions` function. The ``{HEMCO_Diag}_{DATE}_control.png`` plot shows the map of the HEMCO diagnostic for the control run (no assimilation); the ``{HEMCO_Diag}_{DATE}_ensemble_mean.png`` plot shows the map of the HEMCO diagnostic for the ensemble mean; and the ``{HEMCO_Diag}_{DATE}_ensemble_mean.png`` plot shows the map of the HEMCO diagnostic for the ensemble standard deviation. Users can generate plots at monthly resolution by setting ``scalefactor_plot_freq`` to ``monthly``. .. option:: Observation ts compare plots These plots are titled ``observations_ts_compare_{OBSERVER_KEY}.png`` and ``observations_ts_compare_{OBSERVER_KEY}_w_control.png``, one each for each observed species. These plots show a timeseries depicting (1) the mean observation averaged across the entire spatial domain, (2) the ensemble mean simulated observation averaged across the entire spatial domain (with standard deviation plotted with a dotted line), and (3) the control simulated observation averaged across the spatial domain (only for the ``_w_control`` plots). .. note:: This is a key figure output from postprocessing; it indicates if the assimilation has successfully tracked observations. .. option:: Surface mean timeseries These plots, one generated for each species in the control vector (``CONTROL_VECTOR_CONC`` within ``ens_config.json``), show the mean surface concentration value over time (with the ensemble standard deviation showed with a dotted line). The files are named with pattern ``surfmean_ts_{SPECIES}.png``. .. option:: Emissions timeseries These plots, one generated for each diagnostic in the HEMCO Diagnostic plot list (``hemco_diags_to_process`` within ``ens_config.json``), show the post-assimilation total emissions against the prior total emissions over time. The files are named with pattern ``timeseries_totalemissions_{HEMCO_Diag}_against_prior.png``. .. _postprocessed files: The postprocessed data files ~~~~~~~~~~~~~ After the default postprocessing run, the following data files will be present in the ensemble ``postprocess/`` folder. .. option:: bigy_arrays_for_plotting.pkl A dictionary containing gridded observations and simulated observations. Output by the :py:func:`makeBigYArrays` function, which contains more details. .. option:: bigY.pkl A dictionary containing observations and simulated observations as arrays. Output by the :py:func:`makeYEachAssimPeriod` function, which contains more details. .. option:: Scalefactor .nc files NetCDF files containing aggregated scaling factors from across the ensemble; ensemble number is stored in a new dimension ``Ensemble``. Output by the :py:func:`combineScaleFactors` function, which contains more details. .. option:: combined_HEMCO_diagnostics.nc file NetCDF files containing aggregated emissions from across the ensemble; ensemble number is stored in a new dimension ``Ensemble``. Output by the :py:func:`combineHemcoDiag` function, which contains more details. .. option:: control_HEMCO_diagnostics.nc file NetCDF files containing aggregated emissions from the control run. Output by the :py:func:`combineHemcoDiagControl` function, which contains more details. .. option:: controlvar_pp.nc file NetCDF file containing the ensemble-wide surface species concentrations for species in the control vector (``CONTROL_VECTOR_CONC`` within ``ens_config.json``). Output by the :py:func:`makeDatasetForEnsemble` function, which contains more details. The postprocessing API ------------- Here we list the postprocessing functions, which are stored in the ``postprocess_tools.py`` and ``map_tools.py`` files in the ``postprocessing`` folder. These functions are called in the default postprocessing workflow, and can also be used Functions in postprocess_tools ~~~~~~~~~~~~~ Here is the documentation for the postprocessing toolkit, present in the ``postprocess_tools.py`` file in the ``postprocessing`` folder. .. py:function:: globDirs(ensemble_dir,removeNature=False,includeOutputDir=False) For a given ensemble directory, get all of the ensemble member run directory paths, their directory names, and their numeric labels and returns them in sorted order. :param str ensemble_dir: Path to ensemble run directory. :param bool removeNature: True or False, should we remove ensemble run directory 0 (control run). :param bool includeOutputDir: True or False, write paths to go to OutputDir (where the GEOS-Chem model history is stored) or the individual top level ensemble run directory. :return: List of (1) list of paths to ensemble run directory members; (2) list of ensemble run directory names; and (3) list of numeric directory labels. :rtype: list .. py:function:: globSubDir(hist_dir,timeperiod=None,hourlysub = 6) For a given GEOS-Chem output directory, get all the SpeciesConc files in order for a given time period and return the filenames as a list. :param str hist_dir: Path to GEOS-Chem output directory. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. :param int hourlysub: Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours. :return: List of filenames for SpeciesConc files. :rtype: list .. py:function:: globSubDirLevelEdge(hist_dir,timeperiod=None,hourlysub = 6) As with :py:func:`globSubDir`, but for LevelEdgeDiag files. :param str hist_dir: Path to GEOS-Chem output directory. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. :param int hourlysub: Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours. :return: List of filenames for LevelEdgeDiag files. :rtype: list .. py:function:: combineScaleFactors(ensemble_dir,output_dir,timeperiod=None,flag_snapshot=False,return_not_write=False) Combine emissions scaling factors from across the ensemble and save (or return) them as a single NetCDF or xarray DataSet, with a new dimension called "Ensemble" representing ensemble number. One dataset is saved or returned for each scale factor type. :param str ensemble_dir: Path to CHEEREIO ensemble directory. :param str output_dir: Path to where the combined scaling factor NetCDF should be saved. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. :param bool flag_snapshot: Flag the output file as a snapshot (True only by the CHEEREIO snapshot script). :param bool return_not_write: Return the combined dataset rather than writing it as a NetCDF file. :return: If return_not_write is True, a dictionary containing the scale factor names as keys and xarray DataSets with the combined scaling factors as values. :rtype: dict .. py:function:: combineHemcoDiag(ensemble_dir,output_dir,timeperiod=None) Combine HEMCO Diagnostics (e.g. emissions) from across the ensemble and save them as a single NetCDF, with a new dimension called "Ensemble" representing ensemble number. :param str ensemble_dir: Path to CHEEREIO ensemble directory. :param str output_dir: Path to where the combined HEMCO diagnostic NetCDF should be saved. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. .. py:function:: combineHemcoDiagControl(ensemble_dir,output_dir,timeperiod=None) Combine HEMCO Diagnostics (e.g. emissions) from the control run as a single NetCDF. :param str ensemble_dir: Path to CHEEREIO ensemble directory. :param str output_dir: Path to where the combined HEMCO diagnostic NetCDF should be saved. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. .. py:function:: makeDatasetForDirectory(hist_dir,species_names,timeperiod=None,hourlysub = 6,subset_rule = 'SURFACE', fullpath_output_name = None) Combine GEOS-Chem species concentration output from a single ensemble member as a single dataset and either save to a NetCDF file or return. :param str hist_dir: Path to GEOS-Chem output directory. :param list species_names: List of species that we would like to process. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. :param int hourlysub: Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours. :param str subset_rule: Which vertical level(s) to save data from. SURFACE is the surface, 850 is the 850hPa level, and ALL is all vertical data. :param str fullpath_output_name: Path and filename of the NetCDF file to which we should save the combined data. If ``None``, return the data instead :return: If fullpath_output_name is ``None``, an xarray DataSet with the combined concentrations. :rtype: DataSet .. py:function:: makeDatasetForEnsemble(ensemble_dir,species_names,timeperiod=None,hourlysub = 6,subset_rule = 'SURFACE', fullpath_output_name = None) Combine GEOS-Chem species concentration output from across the ensemble as a single dataset, with a new Ensemble dimension denoting ensemble member number, and either save to a NetCDF file or return. :param str ensemble_dir: Path to CHEEREIO ensemble directory. :param list species_names: List of species that we would like to process. :param list timeperiod: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. :param int hourlysub: Only grab files whose hour timestamp is divisible by this number; i.e. 6 means that we grab data every six hours. :param str subset_rule: Which vertical level(s) to save data from. SURFACE is the surface, 850 is the 850hPa level, and ALL is all vertical data. :param str fullpath_output_name: Path and filename of the NetCDF file to which we should save the combined data. If ``None``, return the data instead :return: If fullpath_output_name is ``None``, an xarray DataSet with the combined concentrations. :rtype: DataSet .. py:function:: makeYEachAssimPeriod(path_to_bigy_subsets,startdate=None,enddate=None,fullpath_output_name = None) Combine the intermediate Y datasets, as output by :ref:`HIST Ensemble`, into a dictionary for the entire ensemble run. These Y datasets contain simulated observations from GEOS-Chem aligned with actual observations, along with other metadata. Returns a combined dictionary, where the keys are timestamps. :param str path_to_bigy_subsets: Path to where the intermediate values of Y are saved out by the ensemble (postprocess/bigy in your ensemble installation). :param datetime startdate: Start date for when we should process Y datasets. :param datetime enddate: End date for when we should process Y datasets. :param str fullpath_output_name: Path and filename of the pickle file to which we should save the combined Y dataset. If ``None``, return the data instead. :return: If fullpath_output_name is ``None``, an dictionary with the combined Y datasets. The keys are timestamps and the values are the bigY data dictionary for the given day -- entries in these include 'Latitude', 'Longitude', 'Observations', and the species name. See :ref:`HIST Ensemble` for details. :rtype: dict .. py:function:: plotSurfaceCell(ds,species_name,latind,lonind,outfile=None,unit='ppt',includesNature=False) Plot a timeseries of the surface concentrations for a single grid cell. :param DataSet ds: DataSet of the combined ensemble species concentrations, output by :py:func:`makeDatasetForEnsemble`. :param list species_name: Species to plot. :param int latind: Index of latitude of cell we will plot. :param int lonind: Index of longitude of cell we will plot. :param str outfile: Name of image file containing plot. If None, display the plot instead. :param str unit: either 'ppm', 'ppb', or 'ppt'; CHEEREIO will multiply the GEOS-Chem mole-mole ratio to get these units. :param bool includesNature: True or False, include the no-assimilation control run in plot. .. py:function:: plotSurfaceMean(ds,species_name,outfile=None,unit='ppt',includesNature=False) Plot a timeseries of the average surface concentrations across the domain. :param DataSet ds: DataSet of the combined ensemble species concentrations, output by :py:func:`makeDatasetForEnsemble`. :param list species_name: Species to plot. :param str outfile: Name of image file containing plot. If None, display the plot instead. :param str unit: either 'ppm', 'ppb', or 'ppt'; CHEEREIO will multiply the GEOS-Chem mole-mole ratio to get these units. :param bool includesNature: True or False, include the no-assimilation control run in plot. .. py:function:: tsPlotTotalEmissions(ds_ensemble,ds_prior,collectionName,useLognormal = False, timeslice=None,outfile=None) Plot a timeseries of a HEMCO diagnostic of interest and compare it to control (no assimilation). :param DataSet ds_ensemble: DataSet of the combined ensemble HEMCO diagnostics, output by :py:func:`combineHemcoDiag`. :param DataSet ds_prior: DataSet of the control HEMCO diagnostics, output by :py:func:`combineHemcoDiagControl`. :param list collectionName: Name of HEMCO diagnostic collection to plot. :param bool useLognormal: True or False, are we using lognormal errors for emissions? This affects how the averaging is done. Usually supplied by ``lognormalErrors`` in ``ens_config.json``. :param list timeslice: A list of two datetime objects indicating the start and end of the time period of interest. Leave as None to request the entire time period. :param str outfile: Name of image file containing plot. If None, display the plot instead. .. py:function:: tsPlotSatCompare(bigY,species,numens,unit='ppb',observer_name='Observations',useControl=False,outfile=None) Plot a timeseries of simulated observations from the ensemble against the true observations, along with the control (no assimilation) simulation if requested. :param dict bigY: Dictionary containing the Y dictionaries across the entire ensemble. Output from :py:func:`makeYEachAssimPeriod`. :param str species: Species to plot :param int numens: Number of ensemble members. :param str unit: Unit to use on the plot label; usually stored in the ``OBSERVATION_UNITS`` entry in ``ens_config.json``. :param str observer_name: Name of the observer to use on the plot legend; usually stored in the ``OBS_TYPE`` entry in ``ens_config.json``. :param bool useControl: True or False, should we include the control simulation as a separate line on the plot? :param str outfile: Name of image file containing plot. If None, display the plot instead. .. py:function:: tsPlot(time,ensmean,enssd,species_name,unit,nature=None,priortime=None,prior=None,outfile=None) Utility method to plot a timeseries of ensemble data, with dotted standard deviation, along with control the (no assimilation) simulation data if requested. :param array time: Array of time values to plot on the x axis. :param array ensmean: Array of the ensemble mean value of the data of interest. :param array enssd: Array of the ensemble standard deviation value of the data of interest. :param str species_name: Name of species for plot title :param str unit: Unit to use on the plot label. :param array nature: Optional array of values, matching ``ensmean`` in time, which will be labeled 'Nature' on the plot. :param array priortime: Optional array of time values for the prior (no assimilation), which will be labeled 'Prior' on the plot. :param array prior: Optional array of values for the prior, corresponding with ``priortime`` in time, which will be labeled 'Prior' on the plot. :param str outfile: Name of image file containing plot. If None, display the plot instead. .. py:function:: makeBigYArrays(bigy,gclat,gclon,nEnsemble,postprocess_save_albedo=False,useControl=False) Take the aggregated big Y data dictionary, as output by :py:func:`makeYEachAssimPeriod`, and processes it into a dictionary containing gridded NumPy arrays. The gridded NumPy arrays are used for plotting. :param dict bigY: Dictionary containing the Y dictionaries across the entire ensemble. Output from :py:func:`makeYEachAssimPeriod`. :param array gclat: Latitude from GEOS-Chem grid. :param array gclon: Longitude from GEOS-Chem grid. :param int nEnsemble: Number of ensemble members. :param bool postprocess_save_albedo: True or False, should we process albedo data from TROPOMI CH4? :param bool useControl: True or False, should we process simulated observation data from the control run? :return: A dictionary with keys "obscount", containing the total number of observations; "obscount_avg", containing the total number of observations after averaging to the GEOS-Chem grid; "obs", containing actual observations; "sim_obs", containing simulated observations from the GEOS-Chem ensemble mean; "control", containing simulated observations from the GEOS-Chem control run if requested; and a list of dates under "dates" and species under "species". The NumPy arrays have dimension (date, species, lat, lon), indexed by the "dates" dictionary entry, "species" dictionary entry, gclat, and gclon arrays respectively. :rtype: dict Functions in map_tools ~~~~~~~~~~~~~ Here is the documentation for the postprocessing mapmaking toolkit, present in the ``map_tools.py`` file in the ``postprocessing`` folder. .. py:function:: plotMap(m,lat,lon,flat,labelname,outfile,clim=None,cmap=None,useLog=False,minval = None) Utility method to plot a 2D dataset on a map. :param Basemap m: Basemap object for plotting. :param array lat: Latitude values. :param array lon: Longitude values. :param array flat: 2D array of data to plot. :param str labelname: Name of label for color bar. :param str outfile: Name of image file containing plot. :param list clim: List of the minimum and maximum values for the colorbar. If None, use the maximum and minimum values of the dataset. :param colormap cmap: Colormap to use for plot; defaults to jet if none supplied. :param bool useLog: True or False, use a log scale for the colorbar. :param float minval: Minimum value to include in the plot; below this value set to nan. No minimum if ``None``. .. py:function:: plotEmissions(m,lat,lon,ppdir, hemco_diags_to_process,plotWithLogScale=True, min_emis=None,min_emis_std=None, plotcontrol=True,useLognormal = False, aggToMonthly=True) Plot emissions, listed in ``hemco_diags_to_process``, on a map and write to file. Also plot ensemble standard deviations and control emissions. :param Basemap m: Basemap object for plotting. :param array lat: Latitude values. :param array lon: Longitude values. :param str ppdir: Path to postprocess directory. This function will read in the data files output from :py:func:`combineHemcoDiag` and :py:func:`combineHemcoDiagControl` and save the files here. :param list hemco_diags_to_process: List of hemco diagnostics to plot. Usually given by ``hemco_diags_to_process`` in ``ens_config.json``. :param bool plotWithLogScale: True or False, use a log scale for the colorbar. :param float min_emis: Minimum emissions value to include in the plot; below this value set to nan. No minimum if ``None``. :param float min_emis_std: Minimum emissions standard deviation value to include in the plots of emissions standard deviations; below this value set to nan. No minimum if ``None``. :param bool plotcontrol: True or False, should we plot the control emission maps? :param bool useLognormal: True or False, are we using lognormal errors for emissions? This affects how the averaging is done. Usually supplied by ``lognormalErrors`` in ``ens_config.json``. :param bool aggToMonthly: True or False, should we average emissions data to monthly resolution when we plot? Uses :py:func:`agg_to_monthly` to aggregate. .. py:function:: plotScaleFactor(m,lat,lon,ppdir, useLognormal = False, aggToMonthly=True) Plot emissions scaling factors on a map and write to file. :param Basemap m: Basemap object for plotting. :param array lat: Latitude values. :param array lon: Longitude values. :param str ppdir: Path to postprocess directory. This function will read in the data files output from :py:func:`combineScaleFactors` and save the files here. :param bool useLognormal: True or False, are we using lognormal errors for emissions? This affects how the averaging is done. Usually supplied by ``lognormalErrors`` in ``ens_config.json``. :param bool aggToMonthly: True or False, should we average emissions data to monthly resolution when we plot? Uses :py:func:`agg_to_monthly` to aggregate. .. py:function:: agg_to_monthly(dates, to_agg) Aggregate a NumPy array to monthly resolution and return. :param array dates: An array of dates, which are used to aggregate to monthly resolution. :param array to_agg: A three or four dimensional array to aggregate to monthly resolution. If three-dimensional, we assume time is the 1st entry; if four-dimensional, we assume time is the second entry. :return: A list with (1) the monthly dates, and (2) the aggregated array now at monthly resolution. :rtype: list .. _New field in postprocessing: Adding a new observation field to the postprocessing workflow ------------- For many observations, there are quantities other than the observation value itself that are relevant in the data assimilation problem. For example, albedo can impact remote sensing retrievals; users might want to automatically plot/store albedo values to see if they correlate with values of concern. Fortunately for you, CHEEREIO can handle any of these fields automatically without any user code! In the course of developing your Observation Operator, you might stored fields in an ObsData object other than observations (see :ref:`ObsData` for a description of this class). For example, the ObsPack operator stores fields such as altitude. These data are added to ObsData objects using the ``addData`` function with custom keys. Here is a relevant line of code from the ObsPack operator: :: toreturn.addData(utc_conv=ObsPack['utc_conv'],altitude=ObsPack['altitude'],pressure=GC['pressure'].values,obspack_id=ObsPack['obspack_id'],platform=ObsPack['platform'],site_code=ObsPack['site_code']) Any of these fields are automatically available to the postprocessing workflow. No code is required. The postprocessing routine will automatically try and plot any of these fields that the user lists under ``EXTRA_OBSDATA_FIELDS_TO_SAVE_TO_BIG_Y`` and ``EXTRA_OBSDATA_FIELDS_TO_REGRID_AND_PLOT`` in the ``ens_config.json`` file. See the :ref:`postprocessing settings` page for details on how these settings work. Again, there is no code required on your part!