Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

5. Open Data Node Modules

ODN-architecture-overview.png

Open Data Node consists of the following modules:

...

SearchPortal, which allows users to search the published data, is not described in this document as it is separate application, not part of ODN.

 

5.1. Module ODN/UnifiedViews

Module ODN/UnifiedViews is an ETL& data enrichment tool.

...

Output of the module is the extracted and transformed machine readable data stored in ODN/Storage. Again, data is expected to be structured, tabular or linked data.

5.1.1. UnifiedViews - state of the art

Module ODN/UnifiedViews will use as its base the tool UnifiedViews (https://github.com/UnifiedViews). It is an ETL framework with a native support for transforming RDF data. UnifiedViews allows users to define, execute, monitor, debug, schedule, and share data transformation tasks.

...

Technical structuring and licensing of UnifiedViews allows DPUs to be licensed not just as Open Source, but also using proprietary license. This is a planned feature of the tool needed by use cases where commercial exploitation is needed. ODN will support same commercial use cases.

5.1.1.1. UnifiedViews components and dependencies

Figure below depicts current maven modules in UnifiedViews and its dependencies. Modules in the yellow box are visible to DPU developers. The most important modules are:

  • frontend - Management GUI of UnifiedViews

  • backend - Engine running the data transformation tasks

  • commons-app - DAO & Services module, which is common to frontend and backend modules; it is used to store configuration for pipelines, DPUs, pipeline executions etc.

  • dataunit-rdf, dataunit-file - Modules with interfaces for data units; DPU developers writing new DPUs use these modules to read data from input data units and write data to output data units

uv-ComponentModel.png

 

5.1.2. Structure of the ODN/UnifiedViews and its context:

odn-uv-structure.png

 

ODN/UnifiedViews comprises of the important components as follows:  

  • DAO & Service - used to access database where configuration of ETL tasks and its executions is stored (realized by module commons-app in Figure XX - from chapter 4.1.1.1.)

  • HTTP REST Transformation API - Services from DAO & Services layer exposed as HTTP REST methods. Used by ODN/Management module (this component is not realized by any module in Figure XX)

  • Data Processing Engine - Robust engine running the manually launched or scheduled transformation tasks - transformations may include data cleansing, linking, integration, quality assessment (realized by “backend” module in Figure XX)

  • Management GUI - GUI used to manage the configuration of pipelines, debugging executions, etc. (realized by “frontend” module in Figure XX)

 

5.1.3. Interaction with other modules

1. ODN/UnifiedViews loads the transformed data to ODN/Storage. A special DPUs - RDF data mart loader and Tabular data mart loader must be provided to load transformed data to ODN/Storage to the corresponding data store. The data must be stored there together with metadata, so that ODN/Publication module knows which resources (tables, graphs) are associated with which pipeline/dataset.

...

  • show the pipeline detail in an expert mode (user may drag&drop DPUs, fine-tune pipeline configuration)

  • show the detailed results of pipeline executions (browse events/logs)

  • debug data being passed between DPUs

  • have an access to advanced scheduling options

5.2. Module ODN/Storage

The purpose of this module is to store the transformed data produced by ODN/UnifiedViews. ODN/Publication module uses ODN/Storage to get the transformed data, so that it can be published - provided to data consumers.

 

5.2.1. Structure of the ODN/Storage and its context

odn-storage-structure.png

Two important components of ODN/Storage are:

  • RDBMS data mart

  • RDF data mart

5.2.1.1. RDBMS data mart

RDBMS data mart is a tabular data store, where data is stored when data publisher wants to prepare CSV dumps of the published dataset or provide REST API for data consumers.

...

Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) as the only RDBMS implementation. As part of ODN, we will employ JDBC to add support for wider range of databases. Testing and validation will be done based on feedback from users (currently we plan to work also with PostgreSQL).

5.2.1.2. RDF data mart

Data is stored in RDF data mart when data publisher wants to prepare for data consumers RDF dumps of the published dataset or provide SPARQL endpoint on top of the published dataset.

...

Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) and Sesame (http://www.openrdf.org/) as RDF data mart implementation. As part of ODN, we will employ SAIL API to add support for wider range of triplestores. Testing and validation will be done based on feedback from users.

5.2.2. Interaction with other modules

1. Every transformation pipeline (ODN/UnifiedViews) can contain one or more RDF/RDBMS data mart loaders - DPUs, which load data resulting from the transformation pipeline to the corresponding data mart (RDF/RDBMS).

...

  • How many RDF graphs/tables is stored in RDF/RDBMS data mart in total/for the given dataset ID?

  • How many RDF triples are stored in certain RDF graph in RDF data mart?

  • How many records are in certain table in RDBMS data mart?

 

5.3. Module ODN/Publication

Module responsible for publishing data via REST APIs, SPARQL endpoint or as data dumps in RDF or CSV formats. Published data is already transformed as defined by data transformation pipelines in ODN/UnifiedViews and stored in ODN/Storage.

...

Data administrators/publishers may also configure some specific settings per each publication option: to tweak dump generation process (like which RDF serialization to use: Turtle, XML, etc.), to select which resources (tables, graphs) associated with the transformed dataset (and stored in ODN/Storage) should be published - made available to data consumers, etc.

 

5.3.1. Structure of the ODN/Publication and its context

odn-publication-structure.png

ODN/Publication comprises of the important components as follows:  

...

Note: As part of data publication, some metadata will be published by this module too (for example “Last Modification Time” will be included in appropriate HTTP headed in response). But publication of metadata is mainly responsibility of ODN/Catalog (see section 4.5).

5.3.2. File dumps

ODN/Publication module supports creation of file dumps in CSV or RDF formats. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, CSV or RDF dump may be created. The dump in the CSV/RDF fromat is created if the the data publisher decides so.

...

Finally, new entry in the Atom feed (http://en.wikipedia.org/wiki/Atom_(standard)) associated with the processed dataset is created; such feed points data consumers to the file(s) in the git repository, where the published data and metadata is. Such feed must be reachable from the dataset record in the ODN/Catalog module.

5.3.2.1. RDF dumps

RDF dump may be published only if the result of the dataset transformation is available in RDF data mart in ODN/Storage.

To create the dump, ODN/Publication queries the RDF data mart via SPARQL construct query to get dump in N-Triples (http://www.w3.org/TR/2014/REC-n-triples-20140225/) RDF serialization format. We use N-Triples as RDF serialization format, because it is line oriented serialization format which may be easily versioned by Git.

5.3.2.2. CSV dumps

CSV dump may be published only if the result of the dataset transformation is available in RDBMS data mart in ODN/Storage.

To create the dump, ODN/Publication module exports the desired table in RDBMS data mart  as CSV dump.

5.3.3. SPARQL endpoint

ODN/Publication module supports publication of data via SPARQL endpoints. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, data may be made available via SPARQL endpoint. Data is made available via SPARQL endpoint only if the data publisher decides so. Data may be made available via SPARQL endpoint only if the result of the dataset transformation is available in RDF data mart.

To make the data available via SPARQL endpoint, ODN/Publication module provides data consumers with a simple querying interface, where data consumer may query the published data and associated metadata (obtained from ODN/InternalCatalog) using SPARQL query. There is no versioning in this case, only latest data is available via SPARQL endpoint.

5.3.4. Rest API

ODN/Publication module supports creation of REST APIs for data consumption. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, REST API may be generated for the published data. REST API is generated if the data publisher decides so. REST API may be generated only if the result of the dataset transformation is available in RDBMS data mart.

...

This method takes advantage of a lot of existing software and infrastructure, mainly Git versioning tool and for example GitHub (or GitHub like) repositories and is most suitable to software developers and subset of data analysts who already use such tools.

5.3.5. Interaction with other modules

1. ODN/Management initiates any publication process via Publication API of ODN/Publication. ODN/Publication module uses ODN/Storage to get the data which should be published.

...

4. Data consumers may (1) download CSV/RDF dumps, (2) use SPARQL endpoints, (3) use REST APIs.

5.4. Module ODN/InternalCatalog

Before introducing ODN/InternalCatalog module, the general concept of data catalog is introduced.

5.4.1. Data Catalog

Data catalog holds metadata about each published dataset.  Data catalog allows its users to browse/search the list of datasets and to see the metadata for every published dataset. Screenshot of a sample data catalog provided by data.gov.uk is shown below.

...

There are already available solutions which implement data catalog functionality, such as CKAN and DKAN.

5.4.1.1. Comparison of CKAN/DKAN

CKAN (http://ckan.org/features/) is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available. Note: We may also consider Etalab (https://github.com/etalab), a fork of CKAN.

...

CKAN: extensions https://github.com/ckan/ckan/wiki/List-of-extensions

 

5.4.2. Data Catalog in ODN/InternalCatalog

Module ODN/InternalCatalog is the first module which encapsulates the functionality of data catalog. The data catalog provided by ODN/InternalCatalog module is used to manage datasets which should be transformed/published by ODN; it also allows data publishers to see details about the transformation/publishing process. It is an internal catalog, thus, it is not visible to public, but only data publisher/data administrator can use the catalog.  

...

  • depict the data processing pipeline, which is associated with the transformed & published dataset

  • run data transformation/publishing from the catalog UI

  • provide brief information about the status of the dataset transformation

  • provide link to ODN/Publication module’s configuration dialog which configures how the dataset in the catalog is published

5.4.3. Interaction with other modules

ODN/InternalCatalog is used by ODN/Management to hold and present metadata about the datasets being transformed/published by the data publisher. On request, ODN/InternalCatalog publishes the internal data catalog records about the datasets already published to ODN/Catalog module.  

 

5.5. Module ODN/Catalog

ODN/Catalog is the second module which encapsulates the functionality of data catalog. ODN/Catalog holds metadata about each dataset, which is published by ODN. This data catalog is publicly visible, the primary users of this catalog are data consumers, who may browse/search the published datasets’ metadata; data consumer may also get a link to the dataset’s dump or API, so that they can consume the data in the dataset.

...

Module ODN/Catalog is internally using the same tool as ODN/InternalCatalog to ensure the core data catalog functionality, i.e. DKAN.

5.5.1. Interaction with other modules

This module is used by ODN/Management to create new record or adjust the existing record in ODN/Catalog when the dataset is transformed by ODN/UnifiedViews and published by ODN/Publication module. The record in ODN/Catalog is built based on the metadata in ODN/InternalCatalog and based on the information about the location of REST APIs, Atom feeds referring data dumps, etc., provided by ODN/Publication module.

 

5.6. Module ODN/Management

Module responsible for managing the process of dataset transformation and publication. The diagram below shows the interaction of ODN modules when a dataset is published. The diagram below is showing the case when the dataset publication is launched manually; however, it may be also scheduled by ODN, so that it runs at certain times (e.g., every month).  

 

odn-management-publication-seq.png

5.6.1. Wizard for preparing the transformation task

ODN/UnifiedViews provides standard dialog for editing the data transformation pipeline. Further, ODN/Management provides a wizard (for inexperienced users) to prepare the transformation task. Wizard should be implemented by ODN/Management, using ODN/UnifiedViews HTTP REST Transformation API for interacting with transformation pipelines.

5.6.2. Structure of ODN/Management and its context

odn-management-structure.png

5.6.3. Interaction with other modules

ODN/Management allows management of the whole data transformation and publication process.  ODN/Management uses ODN/InternalCatalog to store metadata about datasets to be transformed. ODN/Management calls ODN/UnifiedViews (its HTTP Transformation API) to create, execute transformation pipelines or get status of the transformation execution. ODN/Management can instruct ODN/Publication to publish data transformed by ODN/UnifiedViews and stored to ODN/Storage (based on request from ODN administrator/data publisher); this publication may also involve publication to ODN/Catalog.