5. Open Data Node Modules


Open Data Node consists of the following modules:

  • ODN/UnifiedViews

  • ODN/Storage

  • ODN/Publication

  • ODN/InternalCatalog

  • ODN/PublicCatalog

  • ODN/Management

Modules listed above are discussed in more detail in the following sections.

5.1. Module ODN/UnifiedViews

Module ODN/UnifiedViews is an ETL& data enrichment tool.

It is responsible for extracting and transforming source data (datasets), so that they can be published as (linked) open data. The result of the transformation is stored in the database managed by ODN/Storage module.

ODN/UnifiedViews module is responsible for:

  1. extracting data provided by data publishers

  2. transforming these data to machine readable data format; such transformation may include enriching the data, cleansing the data, assessing the quality of the data

  3. storing the machine readable data to the database managed by ODN/Storage.

Input of the module is the data provided by data publishers. Data is expected to be structured, mostly tabular or linked data (RDF). Module will support basic data formats out of the box, support for more complex data formats is available via plugins.

Module will work with different formats (in files),  but preferred is data in RDF format. RFD format will allow usage of advanced data cleansing and enrichment techniques based on linked data also for use cases where output will not be in RDF (i.e. for example cases where ODN will be used to clean CSV files before publishing).

Output of the module is the extracted and transformed machine readable data stored in ODN/Storage. Again, data is expected to be structured, tabular or linked data.

5.1.1. UnifiedViews - state of the art

Module ODN/UnifiedViews will use as its base the tool UnifiedViews (https://github.com/UnifiedViews). It is an ETL framework with a native support for transforming RDF data. UnifiedViews allows users to define, execute, monitor, debug, schedule, and share data transformation tasks.

UnifiedViews was originally developed as a student project at Charles University in Prague and now it is maintained by Semantica.cz, Czech Republic, Semantic Web Company, Austria, and EEA, Slovak Republic.

UnifiedViews allows users to define and adjust data processing tasks (pipelines) using a graphical user interface (see Figure below); the core components of every data processing task are data processing units (DPUs). DPUs may be drag&dropped on the canvas where the data processing task is constructed. Data flow between two DPUs is denoted as an edge on the canvas; a label on the edge clarifies which outputs of a DPU are mapped to which inputs of another DPU. UnifiedViews natively supports exchange of RDF data between DPUs; apart from that, files may be exchanged between DPUs.

unifiedViews-ui.png

UnifiedViews takes care of task scheduling. Users can plan executions of data processing tasks (e.g., tasks are executed at a certain time of the day) or they can start data processing tasks manually. UnifiedViews scheduler ensures that DPUs are executed in the proper order, so that all DPUs have proper required inputs when being launched.

A user may configure UnifiedViews to get notifications about errors in the tasks' executions; user may also get daily summaries about the tasks executed.

To simplify the process of defining data processing tasks and to help users analyzing errors during data processing task executions, UnifiedViews provides users with the debugging capabilities. Users may browse and query (using SPARQL query language) the RDF inputs to and RDF outputs from any DPU.

UnifiedViews framework also allows users to create custom plugins - data processing units (DPUs). Users can also share DPUs with others together with their configurations or use DPUs provided by others.

Technical structuring and licensing of UnifiedViews allows DPUs to be licensed not just as Open Source, but also using proprietary license. This is a planned feature of the tool needed by use cases where commercial exploitation is needed. ODN will support same commercial use cases.

5.1.1.1. UnifiedViews components and dependencies

Figure below depicts current maven modules in UnifiedViews and its dependencies. Modules in the yellow box are visible to DPU developers. The most important modules are:

  • frontend - Management GUI of UnifiedViews

  • backend - Engine running the data transformation tasks

  • commons-app - DAO & Services module, which is common to frontend and backend modules; it is used to store configuration for pipelines, DPUs, pipeline executions etc.

  • dataunit-rdf, dataunit-file - Modules with interfaces for data units; DPU developers writing new DPUs use these modules to read data from input data units and write data to output data units

uv-ComponentModel.png

 

5.1.2. Structure of the ODN/UnifiedViews and its context:

odn-uv-structure.png

 

ODN/UnifiedViews comprises of the important components as follows:  

  • DAO & Service - used to access database where configuration of ETL tasks and its executions is stored (realized by module commons-app in Figure XX - from chapter 4.1.1.1.)

  • HTTP REST Transformation API - Services from DAO & Services layer exposed as HTTP REST methods. Used by ODN/Management module (this component is not realized by any module in Figure XX)

  • Data Processing Engine - Robust engine running the manually launched or scheduled transformation tasks - transformations may include data cleansing, linking, integration, quality assessment (realized by “backend” module in Figure XX)

  • Management GUI - GUI used to manage the configuration of pipelines, debugging executions, etc. (realized by “frontend” module in Figure XX)

 

5.1.3. Interaction with other modules

1. ODN/UnifiedViews loads the transformed data to ODN/Storage. A special DPUs - RDF data mart loader and Tabular data mart loader must be provided to load transformed data to ODN/Storage to the corresponding data store. The data must be stored there together with metadata, so that ODN/Publication module knows which resources (tables, graphs) are associated with which pipeline/dataset.

2. ODN/UnifiedView will provide RESTful management API, which will be used by ODN/Management to:

  • create new data transformation task (pipeline)

  • configure existing pipeline and get configuration of the pipeline

  • delete the pipeline

  • execute the pipeline

  • schedule the pipeline

An excerpt of the methods, which will be available to ODN/Management in a RESTful format is depicted below:

odn-uv-HTTP REST Transformation API.png

 

3. Management GUI of ODN/UnifiedViews is used by ODN/Management to:

  • show the pipeline detail in an expert mode (user may drag&drop DPUs, fine-tune pipeline configuration)

  • show the detailed results of pipeline executions (browse events/logs)

  • debug data being passed between DPUs

  • have an access to advanced scheduling options

5.2. Module ODN/Storage

The purpose of this module is to store the transformed data produced by ODN/UnifiedViews. ODN/Publication module uses ODN/Storage to get the transformed data, so that it can be published - provided to data consumers.

 

5.2.1. Structure of the ODN/Storage and its context

odn-storage-structure.png

Two important components of ODN/Storage are:

  • RDBMS data mart

  • RDF data mart

5.2.1.1. RDBMS data mart

RDBMS data mart is a tabular data store, where data is stored when data publisher wants to prepare CSV dumps of the published dataset or provide REST API for data consumers.

ODN/Storage will use SQL relational database (such as MySQL, PostgreSQL, etc.) for storing tabular data.

Every transformation pipeline can contain one or more Tabular data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDBMS data mart. Every loader loads data into a single table. The name for the table is prepared by ODN/UnifiedViews and is based on the dataset ID and  ID of the tabular data mart loader DPU.

Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which tables should be published by (1) manually specifying all the tables which should be published or by (2) specifying that all results of certain transformation pipeline should be published.

To support the above feature, data being stored to RDBMS data mart must be associated with metadata holding for every table at least:

  • to which dataset the table belongs to

  • which transformation pipeline produced the table

Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) as the only RDBMS implementation. As part of ODN, we will employ JDBC to add support for wider range of databases. Testing and validation will be done based on feedback from users (currently we plan to work also with PostgreSQL).

5.2.1.2. RDF data mart

Data is stored in RDF data mart when data publisher wants to prepare for data consumers RDF dumps of the published dataset or provide SPARQL endpoint on top of the published dataset.

Every transformation pipeline can contain one or more RDF data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDF data mart. Every RDF data mart loader loads data to a single RDF graph. RDF graph represents a context for RDF triples, graph is a collection of RDF triples produced by one RDF data mart loader. The name for the RDF graph is prepared by ODN/UnifiedViews and is based on the dataset ID and  ID of the RDF data mart loader DPU.

Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which RDF graphs should be published by (1) manually specifying all the graphs which should be published or by (2) specifying that results of certain transformation pipeline should be published.

To support the above feature, data being stored to RDF data mart must be associated with metadata holding for every RDF data graph at least:

  • to which dataset the graph belongs to

  • which transformation pipeline produced the graph

Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) and Sesame (http://www.openrdf.org/) as RDF data mart implementation. As part of ODN, we will employ SAIL API to add support for wider range of triplestores. Testing and validation will be done based on feedback from users.

5.2.2. Interaction with other modules

1. Every transformation pipeline (ODN/UnifiedViews) can contain one or more RDF/RDBMS data mart loaders - DPUs, which load data resulting from the transformation pipeline to the corresponding data mart (RDF/RDBMS).

2. ODN/Storage notifies ODN/Publication about changes which happened (dataset updates, etc.) so that ODN/Publication can adapt to the changes.

3. ODN/Publication uses data marts to get required graphs/tables to be published (exported as RDF/CSV dumps, made available via REST API/SPARQL Endpoint). ODN/Publication selects the relevant graphs/tables based on the data publishers preference and metadata associated with tables/graphs.

3. ODN/Management may query ODN/Storage to get statistics about stored data, at least:

  • How many RDF graphs/tables is stored in RDF/RDBMS data mart in total/for the given dataset ID?

  • How many RDF triples are stored in certain RDF graph in RDF data mart?

  • How many records are in certain table in RDBMS data mart?

5.3. Module ODN/Publication

Module responsible for publishing data via REST APIs, SPARQL endpoint or as data dumps in RDF or CSV formats. Published data is already transformed as defined by data transformation pipelines in ODN/UnifiedViews and stored in ODN/Storage.

The module allows data administrators/publishers to select how the published datasets are provided to data consumers; in particular, ODN/Publication module allows users to select:

  • publication of the dumps (CSV for tabular data, RDF for linked data),

  • publication via API (SPARQL Endpoint for RDF data, REST API for tabular data).

Data administrators/publishers may also configure some specific settings per each publication option: to tweak dump generation process (like which RDF serialization to use: Turtle, XML, etc.), to select which resources (tables, graphs) associated with the transformed dataset (and stored in ODN/Storage) should be published - made available to data consumers, etc.

 

5.3.1. Structure of the ODN/Publication and its context

odn-publication-structure.png

ODN/Publication comprises of the important components as follows:  

  • DAO & Service layer - used to access database where configuration and results of publication tasks are stored

  • Publication Management API which is called by ODN/Management when certain dataset should be published or when certain methods of data consumption (REST API, SPARQL Endpoint, dumps) should be enabled or disable

  • Publication Engine - module, which is responsible for:

    • creating dumps for the given dataset

    • configuring SPARQL endpoint/REST API for the given dataset

  • Management GUI - GUI used to manage the configuration of the ODN/Publication module

Note: As part of data publication, some metadata will be published by this module too (for example “Last Modification Time” will be included in appropriate HTTP headed in response). But publication of metadata is mainly responsibility of ODN/PublicCatalog (see section 4.5).

5.3.2. File dumps

ODN/Publication module supports creation of file dumps in CSV or RDF formats. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, CSV or RDF dump may be created. The dump in the CSV/RDF fromat is created if the the data publisher decides so.

To create the dump, ODN/Publication module exports the desired data in ODN/Storage. Afterwords, the dump is versioned using Git (http://git-scm.com/). Git allows data consumers to work with the latest or any other previous version of the dataset.  ODN/Publication also publishes metadata of the dump, which are obtained from ODN/InternalCatalog.

Finally, new entry in the Atom feed (http://en.wikipedia.org/wiki/Atom_(standard)) associated with the processed dataset is created; such feed points data consumers to the file(s) in the git repository, where the published data and metadata is. Such feed must be reachable from the dataset record in the ODN/PublicCatalog module.

5.3.2.1. RDF dumps

RDF dump may be published only if the result of the dataset transformation is available in RDF data mart in ODN/Storage.

To create the dump, ODN/Publication queries the RDF data mart via SPARQL construct query to get dump in N-Triples (http://www.w3.org/TR/2014/REC-n-triples-20140225/) RDF serialization format. We use N-Triples as RDF serialization format, because it is line oriented serialization format which may be easily versioned by Git.

5.3.2.2. CSV dumps

CSV dump may be published only if the result of the dataset transformation is available in RDBMS data mart in ODN/Storage.

To create the dump, ODN/Publication module exports the desired table in RDBMS data mart  as CSV dump.

5.3.3. SPARQL endpoint

ODN/Publication module supports publication of data via SPARQL endpoints. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, data may be made available via SPARQL endpoint. Data is made available via SPARQL endpoint only if the data publisher decides so. Data may be made available via SPARQL endpoint only if the result of the dataset transformation is available in RDF data mart.

To make the data available via SPARQL endpoint, ODN/Publication module provides data consumers with a simple querying interface, where data consumer may query the published data and associated metadata (obtained from ODN/InternalCatalog) using SPARQL query. There is no versioning in this case, only latest data is available via SPARQL endpoint.

5.3.4. Rest API

ODN/Publication module supports creation of REST APIs for data consumption. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, REST API may be generated for the published data. REST API is generated if the data publisher decides so. REST API may be generated only if the result of the dataset transformation is available in RDBMS data mart.

API is based on “Representational state transfer” software architectural style (https://en.wikipedia.org/wiki/Representational_State_Transfer) and - for the purpose of Open Data - will provide read-only functionality: Users will be able to get the data from datasets using HTTP protocol, getting results in JSON, XML, CSV or RDF formats based on their preference.

API is intended to be used by programmers or similarly skilled users who can develop software or scripts. But given the truly simplistic nature of this kind of API, even causal user can work with it using common web browser.

There is no versioning in this case, only latest data is available via REST API.

4.3.6. Dataset replication

Automated efficient distribution of updated data and metadata will be achieved by careful implementation of two main methods mentioned earlier, e.g. file dumps and REST API, complemented with third option based on Git.

First two options are generic and interoperable: they will work regardless of the exact tool being used to replicate the data. At one end there will be ODN, on the other hand it can be anything.

The rest is sort of proprietary: technically based on open formats and protocols but limited to smaller/niche audiences.

Note: There is possibility also for fourth option based on combination of file dumps and peer2peer technologies (like BitTorrent). As of now we do not register a demand for that so it is not in the scope of the development.

4.3.6.1. Via file dumps

Proper publishing of file dumps, along with increments and Atom feeds, combined with proper usage of features of HTTP protocol (cache related headers, range requests, if-modified-since headers etc.) is one option.

4.3.6.2. Via REST API

REST API is another option, but that requires presence of “last modified” (or similar) fields within datasets at the line/record level.

Those two options are generic and interoperable: they will work regardless of the exact tool being used to replicate the data. At one end there will be Open Data Node, on the other hand ca be anything.

4.3.6.3. Via Git

Third option is to take advantage of Git versioing (see section “x.x.x. file dumps”):

  • ‘git clone’ can be used to get a first copy of data

  • ‘git pull’ can be used repeatedly to obtain subsequent updates

This method takes advantage of a lot of existing software and infrastructure, mainly Git versioning tool and for example GitHub (or GitHub like) repositories and is most suitable to software developers and subset of data analysts who already use such tools.

5.3.5. Interaction with other modules

1. ODN/Management initiates any publication process via Publication API of ODN/Publication. ODN/Publication module uses ODN/Storage to get the data which should be published.

2. ODN/Management uses Management GUI of ODN/Publication to set up the settings for creation of CSV/RDF dumps, settings for generating REST APIs, settings for preparing SPARQL endpoint.

3. ODN/Publication react to notifications from ODN/Storage by for example recreating file dumps or invalidating cached information for updated datasets.

4. Data consumers may (1) download CSV/RDF dumps, (2) use SPARQL endpoints, (3) use REST APIs.

5.4. Module ODN/InternalCatalog

Module ODN/InternalCatalog is the first module and main module with which data publishers interact while working with ODN. It encapsulates the functionality of data catalog but this functionality is used to manage datasets which should be transformed/published by ODN; it also allows data publishers to see details about the transformation/publishing process. It is an internal catalog, thus, it is not visible to public, but only data publisher/data administrator can use the catalog.

5.4.1. Interaction with other modules

Pipelines in ODN/UnifiedViews can create and update resources (metadata and data) in datasets maintained in ODN/InternalCatalog (based on associations between datasets and pipelines).

ODN/InternalCatalog automatically replicates data and metadata for datasets and resources into ODN/PublicCatalog, but (of course) only in cases when datats are marked as "public" (i.e. suitable and intended for publication for data users).

5.5. Module ODN/PublicCatalog

ODN/PublicCatalog is the second module which encapsulates the functionality of data catalog. ODN/PublicCatalog holds metadata about each dataset, which is published by ODN. This data catalog is publicly visible, the primary users of this catalog are data consumers, who may browse/search the published datasets’ metadata; data consumer may also get a link to the dataset’s dump or API, so that they can consume the data in the dataset.

ODN/PublicCatalog implements also REST API for datasets where data publisher chose to do so. This is sort of "not nice" from design perspective (as this mixes the sepration between ODN/Publication = data and ODN/PublicCatalog = metadata) but it is practical (it allows ODN to use REAST API features implemented in CKAN, avoiding duplicate implementation).

5.5.1. Interaction with other modules

ODN/InternalCatalog is replicating datasets marked as public into this module.

As part of publication functionality, ODN/PublicCatalog may link to ODN/Storage (specifically to Virtuoso based SPARQWL endpoint GUI) - or ODN/Publication (links to file dumps served from ODN/Storage by Apache).

 

5.6. Module ODN/Management

Module is responsible for management of all components which form so called internal part of ODN (i.e. ODN/InternalCatalog and ODN/UnifiedViews, including also ODN/Management itself). As each of those components has its own web GUI which requires authentication, this module provides mainly:

  1. user management - via usage of midPoint (https://evolveum.com/midpoint/)
  2. Single Sign-On (SSO) - using CAS (https://www.apereo.org/)

TODO: Further details.

 


  • No labels