Observations

Observations in CHEEREIO are handled by user-generated observation operators. Because users are most likely to need to modify CHEEREIO in order to introduce new observation types (such as a currently unsupported satellite product), special care has been taken to make it as easy as possible to add new observation operators without modifying existing CHEEREIO source code. In this article, we outline how observation operators are implemented in CHEEREIO and describe how users can create their own custom observation operators and plug them in to the CHEEREIO workflow.

What is a CHEEREIO observation operator?

As described in the Technical overview of the 4D-LETKF implementation section, CHEEREIO works by combining a nested series of objects called Translators. Translators successively abstract raw observational data and GEOS-Chem output and convert it into the quantities needed for LETKF assimilation. Assimilation results are then passed back down through the translators and saved out in a form compatible with GEOS-Chem. CHEEREIO observation operators take (1) GEOS-Chem output data, supplied as a 4D NumPy array (time, level, latitude, longitude) from other CHEEREIO translators; (2) observation data from file; and maps the GEOS-Chem output onto observation space. This often looks like applying a satellite averaging kernel or air mass factors to create a vector of simulated observations, each corresponding to a real observation matched in time and space. This output is then passed to other parts of CHEEREIO for LETKF assimilation. The Workflow to add a new observation operator section walks through in detail how to write a CHEEREIO observation operator.

The observation operator toolkit

Technically speaking, CHEEREIO observation operators are objects that inherit from The Observation_Translator class. In object-oriented programming, inheritance can be thought of as a sophisticated form of templating. Indeed, the Observation_Translator class itself is mostly empty, and contains instructions to the user on how to write two standardized methods to (1) read observations from file and process them into a Python dictionary formatted for CHEEREIO, and (2) generate simulated observations \(y_i^b\) from GEOS-Chem output, returning the results as an ObsData object (The ObsData class). Users can easily write their own class inheriting from Observation_Translator for a specific use case (like a particular surface or satellite instrument) by implementing these two methods, optionally employing the tools provided in the observation toolkit (detailed below) to make coding easier. The full instructions for adding a new observation are given in Workflow to add a new observation operator.

Any class written with this strict Observation_Translator template will then plug in automatically to the rest of the CHEEREIO workflow; it can then be activated from the main configuration file by any user, not just the original author. CHEEREIO also comes with some pre-written observation operators (such as for the TROPOMI and OMI satellite instruments). Many different observation operators can be used simultaneously, making it natural to perform multispecies data assimilation or assimilation using both surface and satellite data within the CHEEREIO framework. Again, because Observation_Translators handle the details of interpreting a specific observation type, the rest of CHEEREIO can remain ignorant of specifics and operate in a fully abstract environment that can be reused for all simulations.

In the below entries, we will briefly describe the functions and classes supplied in the observation operator toolkit (observation_operators.py).

The produceSuperObservationFunction function

In CHEEREIO, users can opt to aggregate observations together to the GEOS-Chem grid, rather than ingesting observations individually even if several separate observations are made in a GEOS-Chem grid cell for a given point in time. Using “super-observations” is helpful because it reduces the complexity of the LETKF calculation and can help control ensemble behavior in areas where there are dense observations. The one tricky bit is handling error – errors are reduced when aggregating observations. That’s where the produceSuperObservationFunction function comes in.

produceSuperObservationFunction(fname)

Takes as input a string for the function name. Users supply this string in ens_config.json via the SUPER_OBSERVATION_FUNCTION entry. Returns a function as output which will then be used to calculate errors after super-observation aggregation.

Parameters

fname (str) – The name of the function. Currently supported values are “default”, “sqrt”, and “constant”. The details of these functions are described in the AV_TO_GC_GRID entry on the Configuring your simulation page.

Returns

The super observation function super_obs()

Return type

function

Raises

ValueError – if the function name is unrecognized

Users never use the super observation function that is output by the produceSuperObservationFunction function directly, but CHEEREIO does. Therefore it is important that the super observation function has a standardized call signature and output. Details on how to right your own function are given in the (7) [optional] Add a new super observation error function section.

super_obs(mean_error, num_obs, errorCorr=0, min_error=0[, transportError=0])

A function which takes the mean error for a set of observations, the number of observations that are averaged, and a few other parameters, and outputs the new error resulting from the aggregation (which is always less than or equal to the mean error input).

Parameters
  • mean_error (float) – The mean error for the observations which have been aggregated together. Like all parameters, CHEEREIO supplies this number, calculated based on user settings.

  • num_obs (int) – The number of observations which have been aggregated together.

  • errorCorr (float) – The correlation (between 0 and 1) between individual observations. The function must always have this argument in the call signature even if it is not used.

  • min_error (float) – The error floor for the super observation. Error will never fall below this number. The function must always have this argument in the call signature even if it is not used.

  • transportError (float) – Irreducible error attributable to model transport. Only some super observation functions use this quantity.

Returns

The reduced error which will be associated with the super observation in the LETKF calculation

Return type

float

The apply_filters function

CHEEREIO supports real-time filtering of input observations based on user settings; for example, removing observations with high albedo. The apply_filters() function is the best way to perform this filtering, because it is able to connect seamlessly with the rest of the CHEEREIO codebase. For a more information, see (6) [optional] Add observation filters via an extension.

apply_filters(OBSDATA, filterinfo)

Takes raw observations as read from file and a dictionary-based description of how to filter out bad observations, and returns the filtered raw observations with bad data removed.

Parameters
  • OBSDATA (dict) – Raw observation data in dictionary form. The OBSDATA dictionary must have a numpy array labeled “longitude”, one labeled “latitude” and one labeled “utctime”. The “utctime” entry must be formatted by ISO 8601 data time format. ISO 8601 represents date and time by starting with the year, followed by the month, the day, the hour, the minutes, seconds and milliseconds. For example, 2020-07-10 15:00:00.000, represents the 10th of July 2020 at 3 p.m. Timezone is assumed to be UTC. For a good example on how to format OBSDATA, see the read_tropomi() or read_omi() functions in tropomi_tools and omi_tools respectively.

  • filterinfo (dict) – A dictionary describing the filters to be applied. The keys of the dictionary are called filter families and describe the observation type (e.g. OMI_NO2), while the value is a list with filter values specific to that observation type. For example, in the OMI_NO2 filter family, the solar zenith angle filter value is the first entry in the list.

Returns

The post-filtering version of the input dictionary OBSDATA.

Return type

dict

The nearest_loc function

CHEEREIO uses the nearest_loc() function to match observation data with the GEOS-Chem grid. The corresponding index lists are used to (1) get GEOS-Chem columns corresponding with observations, and (2) aggregate observation data to the GEOS-Chem grid.

nearest_loc(GC, OBSDATA)

Find the GEOS-Chem grid box and time indices which best correspond with real observation data.

Parameters
  • GC (DataSet) – An xarray dataset which contains the combined GEOS-Chem model output. GC is provided by other CHEEREIO translators; users creating new observation operators can take it as a given.

  • OBSDATA (dict) – Observation data in dictionary form. See apply_filters() for more details.

Returns

Three NumPy arrays iGC, jGC, and tGC containing the spatial and temporal indices on the GEOS-Chem grid which match observations.

Return type

List of NumPy arrays

The getGCCols function

Observation operators commonly use the getGCCols function to grab GEOS-Chem columns corresponding in space and time with observations. It returns data as a set of 2D arrays (with dimension index of observation, column) which can be used by subsequent observation operator functions.

getGCCols(GC, OBSDATA, species, spc_config, returninds=False, returnStateMet=False, GC_area=None)

Takes aggregated GEOS-Chem data and grabs columns corresponding with observations in space and time, stored as 2D arrays (with dimension index of observation, column). These GEOS-Chem columns are commonly used by observation operators for calculating simulated observations.

Parameters
  • GC (DataSet) – An xarray dataset which contains the combined GEOS-Chem model output. GC is provided by other CHEEREIO translators; users creating new observation operators can take it as a given.

  • OBSDATA (dict) – Observation data in dictionary form. See apply_filters() for more details.

  • species (str) – The species of interest, e.g. “CH4”

  • spc_config (dict) – The CHEEREIO ensemble configuration data, stored as a dictionary. This is provided by CHEEREIO.

  • returninds (bool) – True or False, should we return the three index fectors from nearest_loc() in addition to the GEOS-Chem columns.

  • returnStateMet (bool) – True or False, are we also going to subset meteorological data accompanying GEOS-Chem.

  • GC_area (array) – If we are using the grid cell areas, we supply them here (rare)

Returns

A dictionary containing (1) GC_SPC, a 2D array containing GEOS-Chem columns of the species of interest corresponding with observations; (2) GC_P, a 2D array with GEOS-Chem pressure levels; (3) additional entries depending on if returnStateMet is True and/or GC_area is supplied. Indices can also be returned as a list if returninds is True.

Return type

dict

The averageByGC function

CHEEREIO allows users to specify if they want observations to be aggregated to the GEOS-Chem grid (specified by setting AV_TO_GC_GRID to True in ens_config.json). The averageByGC() function ensures that observational data are aggregated onto the GEOS-Chem grid; it also returns results in an ObsData object, which is the expected return type for the gcCompare() function in an observation operator.

The averageByGC() function accepts two kinds of errors — prescribed errors (relative or absolute) or individual observational errors supplied with observation files.

averageByGC(iGC, jGC, tGC, GC, GCmappedtoobs, obsvals, doSuperObs, superObsFunction=None, other_fields_to_avg=None, prescribed_error=None, prescribed_error_type=None, obsInstrumentError=None, modelTransportError=None, errorCorr=None, minError=None)

Average observational data and other parameters onto the GEOS-Chem grid. Returns an ObsData object, which is compatible with the rest of the CHEEREIO workflow (i.e. expected output of gcCompare()).

Parameters
  • iGC (array) – Index array output by nearest_loc()

  • jGC (array) – Index array output by nearest_loc()

  • tGC (array) – Index array output by nearest_loc()

  • GC (DataSet) – An xarray dataset which contains the combined GEOS-Chem model output. GC is provided by other CHEEREIO translators; users creating new observation operators can take it as a given.

  • GCmappedtoobs (array) – An array of simulated observations, output from an observation operator, of the same length as the observation data.

  • obsvals (array) – An array of observation data.

  • doSuperObs (bool) – True or False, should we reduce error when we aggregate observations together. Supplied by user settings.

  • str (superObsFunction) – Name of the super observation function, fed into produceSuperObservationFunction(). Supplied by user settings.

  • other_fields_to_avg (dict) – Other fields present in the ObsData object can be averaged to the GC grid (e.g. albedo from TROPOMI). Users wishing to do this can supply a dictionary with keys naming the fields in question (must be present in ObsData) and values storing an array of these fields. Note that CHEEREIO automatically will calculate and provide all fields provided in the EXTRA_OBSDATA_FIELDS_TO_SAVE_TO_BIG_Y setting in ens_config.json by default without any user code or input required (including for new observation operators).

  • albedo_nir (array) – As with albedo_swir, but for near wave infrared albedo.

  • blended_albedo (array) – As with albedo_swir, but for blended albedo.

  • prescribed_error (float) – If working with prescribed errors, then this is either the percent or absolute error associated with the observational data.

  • prescribed_error_type (str) – If using prescribed errors, a string for “relative” or “absolute” denoting how prescribed_error should be interpreted (as percent or an absolute value). Supplied by user configuration settings.

  • obsInstrumentError (array) – If working with observational errors, an array of errors associated with each observation.

  • modelTransportError (float) – If using a super-observation function that accounts for model transport error, the transport error. Supplied by the user configuration settings.

  • errorCorr (array) – : If using a super-observation function that accounts for correlation between errors, the error correlation. Supplied by the user configuration settings.

  • minError (array) – If using a super-observation function that accounts for minimum error, the minimum error allowed for a specific observation. Supplied by the user configuration settings.

Returns

An ObsData object containing the aggregated observations.

Return type

ObsData

Raises

ValueError – if the error information is specified incorrectly.

The ObsData class

The ObsData class is a simple data storage class, used by CHEEREIO to handled data output from observation operators. It operates very similarly to a dictionary, but forces expected data to be present and allows users to flexibly add additional data (like albedo) which is present for some but not other operators.

class ObsData(gccol, obscol, obslat, obslon, obstime, **additional_data)

A simple class for storing labeled data output by observation operators.

Variables
  • gccol (array) – A NumPy array which contains GEOS-Chem model output mapped onto corresponding observations (i.e. passed through an operation operator. This is sometimes a 2D array containing all ensemble columns.

  • obscol (array) – An array containing observational data.

  • obslat (array) – An array containing latitude of observations.

  • obslon (array) – An array containing longitude of observations.

  • obstime (array) – An array containing times of observations.

  • additional_data (dict) – Additional data supplied to the constructor as keyword arguments are stored in dictionary form in this variable.

getGCCol()

Returns the gccol attribute

Returns

The gccol attribute of ObsData.

Return type

array

setGCCol(gccol)

Sets the gccol attribute to the input argument

Parameters

gccol (array) – Set gccol attribute to this value.

getObsCol()

Returns the obscol attribute

Returns

The obscol attribute of ObsData.

Return type

array

setObsCol(gccol)

Sets the obscol attribute to the input argument

Parameters

obscol (array) – Set obscol attribute to this value.

getCols()

Returns the gccol and obscol attributes as a list

Returns

The gccol and obscol attributes of ObsData as a list in that order.

Return type

list

getLatLon()

Returns the obslat and obslon attributes as a list

Returns

The obslat and obslon attributes of ObsData as a list in that order.

Return type

list

getObsTime()

Returns the obstime attribute

Returns

The obstime attribute of ObsData.

Return type

array

addData(**data_to_add)

Add custom data fields to ObsData

Parameters

**data_to_add – Add data by keyword argument to the additional_data dictionary attribute as key-value pairs.

getDataByKey(key)

Add custom data fields to ObsData

Parameters

key (str,list) – Get data from the additional_data dictionary attribute by key. If supplying a list of keys, it will return a list of values corresponding to each key.

Returns

The array requested by key, or a list of arrays requested by key, from the additional_data dictionary attribute.

Return type

array,list

The Observation_Translator class

All observation operators that are compatible with CHEEREIO must inherit from the Observation_Translator class. This class itself is basically empty and functions as a template for users to follow in building their own observation operators.

class Observation_Translator(verbose)

All observation operators compatible with CHEEREIO inherit from this abstract class and implement the two required methods getObservations() and gcCompare().

Attr int verbose

Verbosity of output; 1 is default, higher values print more statements during calculation.

Attr dict spc_config

Ensemble configuration data, from ens_config.json.

Attr str scratch

Path to the ensemble scratch folder.

initialReadDate()

This is an optional function that observation operators can implement, and as such is not present in the abstract class. For a sorted list of all the observation files, indicate in a dictionary the start and end datetimes of the data in each file and save as a pickle file into the scratch/ directory. If an observation operator implements this function, then it should be indicated in operators.json (see (4) Update operators.json).

Returns

Start and end dates for each observation data file used in assimilation.

Return type

dict

getObservations(specieskey, timeperiod, interval=None, includeObsError=False))

For a given species and time period of interest, provide a dictionary with all relevant observations and observation metadata (latitude, longitude, time, and others as user needs).

Parameters
  • specieskey (str) – The species of interest. Please note that the “specieskey” variable MUST be a key in the dictionary OBSERVED_SPECIES in ens_config.

  • timeperiod (list) – A list of two datetime objects indicating the start and end times of the observations which we need to aggregate.

  • interval (int) – If this value is not None, only read observations at a given interval (e.g. only a certain hour a day). This is only used to speed up certain kinds of plots; while users are required to accept this argument as an argument, they can opt to raise an error if interval is not None rather than implement this functionality.

  • includeObsError (bool) – True or False, should we load in error data which accompanies observation data? If your data set does not have error data that is useful, you can ignore this or opt to raise an error if this is set to True.

Returns

Observation data formatted as a dictionary. The returned dictionary must have keys for “latitude”, “longitude”, and “utctime”, where UTC time is an ISO 8601 date time string. Actual observation data can be named however the user would like, so long as gcCompare() can handle it.

Return type

dict

Raises

NotImplementedError – if the user fails to implement this function.

gcCompare(specieskey, OBSDATA, GC, GC_area=None, saveAlbedo=False, doErrCalc=True, useObserverError=False, prescribed_error=None, prescribed_error_type=None, transportError = None, errorCorr = None, minError=None))

THE BIG DADDY. This function takes as input observation data (formatted as a dictionary) and GEOS-Chem data from the ensemble (an xarray DataSet), and returns as output an ObsData object with GEOS-Chem data mapped into observation space, along with relevant metadata. This is the heart and soul of the observation operator.

Parameters
  • specieskey (str) – The species of interest. Please note that the “specieskey” variable MUST be a key in the dictionary OBSERVED_SPECIES in ens_config.

  • OBSDATA (dict) – Observation data in dictionary form. See apply_filters() for more details.

  • GC (DataSet) – An xarray dataset which contains the combined GEOS-Chem model output. GC is provided by other CHEEREIO translators; users creating new observation operators can take it as a given.

  • GC_area (array) – If we are using the grid cell areas, we supply them here (rare)

  • saveAlbedo (bool) – True or False, should we save out albedo data into the ObsData object we return. Ignore if you do not use.

  • doErrCalc (bool) – In the case where AV_TO_GC_GRID is set to True, should we calculate the aggregated error and save it out in the returned ObsData object? This is always ignored if AV_TO_GC_GRID is set to False. THIS AND ALL SUBSEQUENT PARAMETERS ARE ONLY USED IF AV_TO_GC_GRID is set to True.

  • useObserverError (bool) – True or False, are we using error accompanying observation files or (if False) are we using prescribed (absolute/relative) errors.

  • prescribed_error (float) – If working with prescribed errors, then this is either the percent or absolute error associated with the observational data.

  • prescribed_error_type (str) – If using prescribed errors, a string for “relative” or “absolute” denoting how prescribed_error should be interpreted (as percent or an absolute value). Supplied by user configuration settings.

  • transportError (float) – If using a super-observation function that accounts for model transport error, the transport error. Supplied by the user configuration settings.

  • errorCorr (array) – : If using a super-observation function that accounts for correlation between errors, the error correlation. Supplied by the user configuration settings.

  • minError (array) – If using a super-observation function that accounts for minimum error, the minimum error allowed for a specific observation. Supplied by the user configuration settings.

Returns

ObsData type object containing observation data, relevant metadata (lat/lon/time/etc), and GEOS-Chem data mapped via this function onto observation space.

Return type

ObsData

Raises

NotImplementedError – if the user fails to implement this function.

Workflow to add a new observation operator

Because of CHEEREIO’s modular design, adding a new observation operator for an arbitrary new observation type (like a new satellite instrument) is straightforward. Here we walk through the process to write a new CHEEREIO observation operator step-by-step. You can always look at existing tools for an additional model to follow (e.g. TROPOMI tools in core/tropomi_tools.py and OMI tools in core/omi_tools.py).

(1) Create a class inheriting from Observation_Translator

Create a new file in the core/ folder to contain your observation operator and whatever support methods you need to write. At a minimum, you will need to import the observation_operators module, because all observation operators compatible with CHEEREIO need to inherit from Observation_Translator.

Suppose we were writing an operator for surface NO2 monitors. Here’s how your new surface_no2_translator.py file might start:

import observation_operators as obsop

class Surface_NO2_Translator(obsop.Observation_Translator):
        def __init__(self,verbose=1):
                super().__init__(verbose)

The init function here is run when the Surface_NO2_Translator object is created. The super() function here just means that we run the default initialization included in the Observation_Translator class; unless you have a very good reason to do so, you should probably use this initialization.

(2) Implement getObservations() function

The getObservations() method takes as input (1) a string representing a key from OBSERVED_SPECIES in ens_config.json, and (2) a pair of datetime objects representing the start and end of the period of interest. This function should return a dictionary of all the relevant observations of the species of interest in this timeperiod. The dictionary should have (1) the observation data stored as a 1D NumPy array, (2) required metadata fields of latitude, longitude, and utc time of the observations each stored as NumPy arrays within the dictionary (see Observation_Translator for details), and (3) additional metadata fields as necessary for your observation operator, such as albedo.

Suppose all of our observation data is in a CSV file. Our getObservations() function might look like this (pseudocode, may not be exactly right):

import observation_operators as obsop
import pandas as pd

class Surface_NO2_Translator(obsop.Observation_Translator):

        def __init__(self,verbose=1):
                super().__init__(verbose)

        def getObservations(self,specieskey,timeperiod, interval=None, includeObsError=False):

                #Get the name of the species we are observaing
                species_of_interest = self.spc_config['OBSERVED_SPECIES'][specieskey]

                #Here we imagine the user specifies the csv file path in the ens_config.json file.
                data_path = self.spc_config['Surface_dirs'][species_of_interest]

                #In this implementation, the NO2 data must be named NO2.csv.
                data_file = f'{data_path}/NO2.csv

                #Load the data
                data = pd.read_csv(data_file)

                #subset the data to the right timespan.
                data = data[(data['date'].dt>=timeperiod[0]) & (data['date'].dt<timeperiod[1])]

                #make an empty dictionary to return, fill it
                to_return = {}
                to_return['NO2'] = data['NO2'] #store data in the dictionary
                to_return['latitude'] = data['latitude']
                to_return['longitude'] = data['longitude']
                to_return['utctime'] = data['utctime']

                #we only include the error associated with the measurement if requested.
                if includeObsError:
                        to_return['error'] = data['error']

                return to_return #return the data

Note that this function is designed to (1) load data from file, (2) subset to the timeperiod of interest, and (3) return in a standardized dictionary form. Notice that the specific CSV file location is loaded from ens_config.json and stored in the self.spc_config object; you can always get user settings from ens_config.json via this object. Although the Surface_dirs is not in our current ens_config.json format, new observation operators will require us to grow the configuration file. See (5) Link observational files from ens_config.json for more details on how to add new setting fields to ens_config.json for your observation operator.

You may want to allow users to filter out bad observational data, using filters they can set via an extension. You should implement that filtering in this function, by following the (6) [optional] Add observation filters via an extension procedure.

(3) Implement gcCompare() function

The gcCompare() method takes as input (1) a string representing a key from OBSERVED_SPECIES in ens_config.json, (2) observational data in dictionary form, as output by getObservations(), (3) GEOS-Chem data as an xarray DataSet, and (4) additional input data, such as that specifying how to handle errors when aggregating to the GEOS-Chem grid. See Observation_Translator for details. We return data as an ObsData object. Here we show a very simplified version of a gcCompare function (pseudocode, may not be exactly right).

def gcCompare(self,specieskey,OBSDATA,GC,GC_area=None,saveAlbedo=False,doErrCalc=True,useObserverError=False, prescribed_error=None,prescribed_error_type=None,transportError = None, errorCorr = None,minError=None):

        #Get the name of the species we are observaing
        species = self.spc_config['OBSERVED_SPECIES'][specieskey]

        #Use one of the utilities included in the Observation Operator toolkit to get GEOS-Chem data matching observations
        GC_col_data = obsop.getGCCols(GC,OBSDATA,species,self.spc_config,returnStateMet=returnStateMet,GC_area=GC_area)

        #The GEOS-Chem 2D array, where the first dimension matches the length of observations and second is the column height.
        GC_SPC = GC_col_data['GC_SPC']

        #Get the surface data
        GC_surf = GC_SPC[:,0]

        toreturn = obsop.ObsData(GC_surf,OBSDATA[species],OBSDATA['latitude'],OBSDATA['longitude'],OBSDATA['utctime'])

        return toreturn

Here all we do is use the getGCCols() function to get GEOS-Chem columns that line up in space and time with our observational data. Then we get the surface data from this column. We create a ObsData object with all the relevant data and return it.

Note that a real observation operator will need to handle errors in the case that users choose to aggregate to the GEOS-Chem grid. Consult produceSuperObservationFunction(), and existing observation operator toolkits, for guidance on how to implement this.

(4) Update operators.json

Now that we have written our observation operator, we have to let CHEEREIO know about it! In the top level directory of CHEEREIO, there is a file called operators.json. Open it and add a new entry. In our Surface NO2 example, we would add the following:

"SURFACE_NO2" : {
        "module_name" : "surface_no2_translator",
        "translator_name" : "Surface_NO2_Translator",
        "implements_initialReadDate" : "False"
}

Here is the description of each of the three options:

module_name

File name of the Python script where the observation operator and support methods lives, without the .py extension.

translator_name

Name of the class inheriting from Observation_Translator, where the main observation operator lives.

implements_initialReadDate

“True” or “False”, do you implement the initial read date function in your observation operator? This is fully optional.

Now that you have added your operator to operators.json, users can activate the observation operator by setting values in the key-value pairs within OBS_TYPE to the name of your operator; in this case, SURFACE_NO2. Here is an example of what ens_config.json could look like:

"OBS_TYPE" : {
        "NO2_OMI":"OMI",
        "NO2_MONITORS":"SURFACE_NO2"
},

In this example, we are using NO2 from OMI and from the SURFACE_NO2 operators. CHEEREIO will now use your operator to handle surface NO2!

(6) [optional] Add observation filters via an extension

CHEEREIO supports real-time filtering of input observations based on user settings; for example, removing observations with high albedo. The apply_filters() function (documented at The apply_filters function)is the best way to perform this filtering, because it is able to connect seamlessly with the rest of the CHEEREIO codebase.

The apply_filters function takes (1) a dictionary of observation data OBSDATA formatted for the standard gcCompare functions in observation operators, and (2) a dictionary of filter information called filterinfo. Here we focus on filterinfo.

filter_info is a dictionary. Keys are specific to a given observation type. By default, these are MAIN, OMI_NO2, and TROPOMI_CH4. Values are a list of filter values, distinct for each observation type. The apply_filters function checks to see if a given key is present in filter_info; if it is, it parses the list of filter values accordingly. In the OMI NO2 case, it looks for solar zenith angle, cloud radiance fraction, and surface albedo and removes data that do not match these filters.

Filter thresholds are usually supplied by ensemble configuration extensions, discussed here: Extensions. Users supply filter values in an extension, which observation operators then load and pass to the apply_filters function. In the case of TROPOMI CH4, this works as follows:

if (self.spc_config['Extensions']['TROPOMI_CH4']=="True") and (self.spc_config['TROPOMI_CH4_FILTERS']=="True"): #Check first if extension is on before doing the TROPOMI filtering
        filterinfo["TROPOMI_CH4"] = [float(self.spc_config['TROPOMI_CH4_filter_blended_albedo']),float(self.spc_config['TROPOMI_CH4_filter_swir_albedo_low']),float(self.spc_config['TROPOMI_CH4_filter_swir_albedo_high']),float(self.spc_config['TROPOMI_CH4_filter_winter_lat']),float(self.spc_config['TROPOMI_CH4_filter_roughness']),float(self.spc_config['TROPOMI_CH4_filter_swir_aot'])]

First, the operator checks that filters are activated, then creates adds an entry to the filterinfo dictionary listing thresholds supplied by the user. Follow this pattern to add your own filters.

(7) [optional] Add a new super observation error function

In CHEEREIO, users can opt to average observations to the GEOS-Chem grid by setting the AV_TO_GC_GRID entry to be True. Your observation operator should consider supporting this functionality. The use of “super observations” is a useful technique to balance prior and observational errors while also reducing the computational complexity of the optimization (by reducing the size of the observational vectors and matrices in the LETKF calculation). The main subtlety that needs to be handled for super observation aggregation is the adjustment of observational error. By default, users can specify one of several error reduction functions in the SUPER_OBSERVATION_FUNCTION section:

sqrt

A modified version of the familiar square root law, where if we aggregate \(n\) observations (indexed by \(i\)) with errors \(\sigma_i\) together, the new error is \(\bar{\sigma}/\sqrt{n}\) where \(\bar{\sigma}\) is the mean of the \(\sigma_i\). The modification accounts for correlations \(c\) between errors (e.g. due to correlated retrieval errors from shared surface type or similar albedo), and for a user-specified minimum error \(\sigma_{\min}\). Thus the equation that is actually applied is given by \(\max\left[\left(\bar{\sigma}\cdot\sqrt{\frac{1-c}{n}+c}\right),\sigma_{\min}\right]\). The correlation \(c\) is taken from OBS_ERROR_SELF_CORRELATION with default value 0, and the minimum error \(\sigma_{\min}\) is taken from MIN_OBS_ERROR with default value 0 (i.e. the normal square root law).

default

As with “sqrt”, but with an additional term accounting for the fact that GEOS-Chem transport errors are perfectly correlated. Because perfectly correlated errors are irriducible no matter how many realizations are averaged, the resulting equation is given by \(\max\left[\sqrt{\bar{\sigma}^2\cdot\left(\frac{1-c}{n}+c\right)+\sigma_t^2},\sigma_{\min}\right]\) where \(\sigma_t\) is transport error supplied by the “transport_error” entry from OTHER_OBS_ERROR_PARAMETERS etnry.

constant

No error reduction applied. In other words, no matter how many observations are averaged, this function just returns :math:`bar{sigma}!

To add a new super observation function, add an additional entry to produceSuperObservationFunction in the observation_operator.py file. This function is described on the The produceSuperObservationFunction function entry. At a minimum, all super observation functions need to accept the following four arguments: (1) mean_error, an array of individual observation error values generated by CHEEREIO; (2) num_obs, an array showing the number of observations in a given grid cell; (3) errorCorr, a float showing the correlation of observations to one another (set as a default of zero); and (4) min_error, a float telling the function the minimum error acceptable in a grid cell (set as a default of zero). You do not need to use all four inputs (see the constant function) but you do need to accept all four for compatability. Additional inputs can be supplied from ens_config.json, as in the case of the default function and its need for transport_error. Additional arguments arguments are supplied by users through the OTHER_OBS_ERROR_PARAMETERS parameter in ens_config.json; just accept these parameters according to their key. See LETKF settings for an example for transport_error.

(8) Share your operator with the CHEEREIO community

Once you are done with your observation operator and have tested it, make a pull request to the main CHEEREIO git repository. That way the community can make use of your tool and advance science faster!

Document your operator by adding an entry in the Existing observation operators section of the documentation. In your documentation, provide the paper that people should cite if they use your operator, so that you get appropriate credit for your work.