Dear All,
in this post I'd like to talk about the work we have done for the
LaMMA consortium.
The ProblemThe purpose of this project is to build a complete
Spatial Data Infrastructure (SDI) to provide a spatio-temporal raster data processing, publishing, and interactive visualisation facility. This platform is candidate to substitute the
current one which was already built leveraging on Open Source software but which was rather static and contained no OGC services.
The data that will be ingested into the system is generated by an existing processing infrastructure which produces a set of different MetOc models. Our goal is to manage the geophysical parameter (or variables) produced by the following models:
- ARW ECM
- 3 Km resolution
- 9 Km resolution
- GFS
The ingestion is started every day at noon and midnight, hence there are 2 run-times a day for each model at a certain resolution and the produced data contains different forecast times.
- ARW ECM (3 days with interval of 1h)
- GFS (8 days with interval of 6h)
The data is produced in GriB format (version 1).
Our SolutionLeveraging on the
OpenSDI suite and specifically on the following components:
as well as some other well known Open Source project such as (Apache
Tomcat, Apache
Http server,
Postgres) we provided an extensible and standard based platform to automatically ingest and publish data.
The infrastructure we have put together is depicted in the deployment diagram below.
|
Deploy diagram |
|
This infrastructure has been designed from the beginning with the goal of being scalable in terms of supporting large number of external users since it is based on a
GeoServer Master/Slave infrastructure where multiple slaves can be installed for higher throughput. Caching will be tackled in a successive phase.
As you can see we provided three access level for different type of users:
- Admin can locally access to the entire infrastructure and add instances of GeoServer to the cluster to improve performances
- Poweruser can remotely add files to ingestion and administer GeoBatch via Basic Autentication
- User can look at ingested data accessing one of the GeoServer slave machines via Apache httpd proxy server. The load of these accesses is distributed between all available slaves.
As mentioned above, the main building blocks are as follows:
- GeoServer for providing WMS, WCS and WFS services with support for the TIME and Elevation dimensions
- GeoNetwork, for publishing metadata for all data with specific customizations for managing the TIME dimensions in the dataset
- GeoBatch, to perform preprocessing and ingestion in near real time of data and related metadata with minimal human intervention
Using GeoBatch for ingestion and data preprocessingIn the LaMMA project the
GeoBatch framework is used to preprocess and ingest the incoming GriB files as well as to handle data removal based on a sliding temporal window (currently set to 7 days) since it was a design decision to keep around for live serving on the last 7 days of forecasts.
Below you can find a diagram depicting one of the automatic ingestion flow we created for the
LaMMA project using the
GeoBatch framework.
|
GeoBatch ingestion flow example |
|
The various building blocks comprising this flow are explained here below:
- NetCDF2GeotiffAction reads the incoming GRIB file and produces a proper set of Geotiff perfoming on-the- fly tiling, pyramiding and unit conversions.Each GeoTiff represent a 2D slice out of one of the original 4D cubes contained in the source GriB file
- ImageMosaicAction uses the GeoServer Manager library to create the ImageMosaic store and layer in the GeoServer Master. The created ImageMosaic contains proper configuration to parse Time and Elevation dimensions' values from the GeoTiff in order to create 4D layers in GeoServer.
- XstreamAction takes an XML file and deserializes it to a Java object this is passed to the next action.
- FreeMarkerAction produces a proper xml metadata file for publishing in GeoNetwork, using a pre-cooked template and the passed data model.
- GeoNetworkAction published the metadata on the target GeoNetwork
- ReloadAction forces a reload on all the GeoServer slaves in order to pick up the changes done by the master instance
This type of flow, (with a slight different set up) is used to convert and publish the 3 different incoming models.
The other type of flow is the
remove flow which is a composed by the following building blocks:
- ScriptingAction executes a remove.groovy script which will:
- calculate the oldest time to retain
- select older files to be removes
- search and remove matching metadata from the GeoNetwork
- remove collected layers and stores from the GeoServer Master catalog
- delete permanently succesfully removed files
- ReloadAction forces a reload on all the GeoServer Slave.
Using GeoNetwork for metadata managementWe have customized the metadata indexing (thanks
Lucene!) in
GeoNetwork in order to be able to index meteorological model execution in terms of their run time as well as in term of their forecast times.
Generally speaking the data we are dealing with is driven by a meterological model which produces daily a certain number of geophysical parameters with temporal validity that spans for certain number of time instants (forecast times) in the future. In
GeoNetwork we are currently creating a new metadata object for each geophysical parameter (e.g. Temperature) of a new model run; this metadata object contains multiple links to WMS requests for each forecast time, leveraging the TIME dimension in
GeoServer (see picture below). Moreover the forecast times themselves are indexed so that advanced searches can be done on them.
If you have questions about the work described in this post or if you want to know more about
our services could help your organization to reach its goals, do not hesitate to
contact us.
The
GeoSolutions team,