Architecture and Design - Open Data Node Modules - Publication

5.3. Module ODN/Publication

Module responsible for publishing data via REST APIs, SPARQL endpoint or as data dumps in RDF or CSV formats. Published data is already transformed as defined by data transformation pipelines in ODN/UnifiedViews and stored in ODN/Storage.

The module allows data administrators/publishers to select how the published datasets are provided to data consumers; in particular, ODN/Publication module allows users to select:

publication of the dumps (CSV for tabular data, RDF for linked data),
publication via API (SPARQL Endpoint for RDF data, REST API for tabular data).

Data administrators/publishers may also configure some specific settings per each publication option: to tweak dump generation process (like which RDF serialization to use: Turtle, XML, etc.), to select which resources (tables, graphs) associated with the transformed dataset (and stored in ODN/Storage) should be published - made available to data consumers, etc.

5.3.1. Structure of the ODN/Publication and its context

ODN/Publication comprises of the important components as follows:

DAO & Service layer - used to access database where configuration and results of publication tasks are stored
Publication Management API which is called by ODN/Management when certain dataset should be published or when certain methods of data consumption (REST API, SPARQL Endpoint, dumps) should be enabled or disable
Publication Engine - module, which is responsible for:

creating dumps for the given dataset
configuring SPARQL endpoint/REST API for the given dataset

Management GUI - GUI used to manage the configuration of the ODN/Publication module

Note: As part of data publication, some metadata will be published by this module too (for example “Last Modification Time” will be included in appropriate HTTP headed in response). But publication of metadata is mainly responsibility of ODN/Catalog (see section 4.5).

5.3.2. File dumps

ODN/Publication module supports creation of file dumps in CSV or RDF formats. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, CSV or RDF dump may be created. The dump in the CSV/RDF fromat is created if the the data publisher decides so.

To create the dump, ODN/Publication module exports the desired data in ODN/Storage. Afterwords, the dump is versioned using Git (http://git-scm.com/). Git allows data consumers to work with the latest or any other previous version of the dataset. ODN/Publication also publishes metadata of the dump, which are obtained from ODN/InternalCatalog.

Finally, new entry in the Atom feed (http://en.wikipedia.org/wiki/Atom_(standard)) associated with the processed dataset is created; such feed points data consumers to the file(s) in the git repository, where the published data and metadata is. Such feed must be reachable from the dataset record in the ODN/Catalog module.

5.3.2.1. RDF dumps

RDF dump may be published only if the result of the dataset transformation is available in RDF data mart in ODN/Storage.

To create the dump, ODN/Publication queries the RDF data mart via SPARQL construct query to get dump in N-Triples (http://www.w3.org/TR/2014/REC-n-triples-20140225/) RDF serialization format. We use N-Triples as RDF serialization format, because it is line oriented serialization format which may be easily versioned by Git.

5.3.2.2. CSV dumps

CSV dump may be published only if the result of the dataset transformation is available in RDBMS data mart in ODN/Storage.

To create the dump, ODN/Publication module exports the desired table in RDBMS data mart as CSV dump.

5.3.3. SPARQL endpoint

ODN/Publication module supports publication of data via SPARQL endpoints. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, data may be made available via SPARQL endpoint. Data is made available via SPARQL endpoint only if the data publisher decides so. Data may be made available via SPARQL endpoint only if the result of the dataset transformation is available in RDF data mart.

To make the data available via SPARQL endpoint, ODN/Publication module provides data consumers with a simple querying interface, where data consumer may query the published data and associated metadata (obtained from ODN/InternalCatalog) using SPARQL query. There is no versioning in this case, only latest data is available via SPARQL endpoint.

5.3.4. Rest API

ODN/Publication module supports creation of REST APIs for data consumption. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, REST API may be generated for the published data. REST API is generated if the data publisher decides so. REST API may be generated only if the result of the dataset transformation is available in RDBMS data mart.

API is based on “Representational state transfer” software architectural style (https://en.wikipedia.org/wiki/Representational_State_Transfer) and - for the purpose of Open Data - will provide read-only functionality: Users will be able to get the data from datasets using HTTP protocol, getting results in JSON, XML, CSV or RDF formats based on their preference.

API is intended to be used by programmers or similarly skilled users who can develop software or scripts. But given the truly simplistic nature of this kind of API, even causal user can work with it using common web browser.

There is no versioning in this case, only latest data is available via REST API.

4.3.6. Dataset replication

Automated efficient distribution of updated data and metadata will be achieved by careful implementation of two main methods mentioned earlier, e.g. file dumps and REST API, complemented with third option based on Git.

First two options are generic and interoperable: they will work regardless of the exact tool being used to replicate the data. At one end there will be ODN, on the other hand it can be anything.

The rest is sort of proprietary: technically based on open formats and protocols but limited to smaller/niche audiences.

Note: There is possibility also for fourth option based on combination of file dumps and peer2peer technologies (like BitTorrent). As of now we do not register a demand for that so it is not in the scope of the development.

4.3.6.1. Via file dumps

Proper publishing of file dumps, along with increments and Atom feeds, combined with proper usage of features of HTTP protocol (cache related headers, range requests, if-modified-since headers etc.) is one option.

4.3.6.2. Via REST API

REST API is another option, but that requires presence of “last modified” (or similar) fields within datasets at the line/record level.

Those two options are generic and interoperable: they will work regardless of the exact tool being used to replicate the data. At one end there will be Open Data Node, on the other hand ca be anything.

4.3.6.3. Via Git

Third option is to take advantage of Git versioing (see section “x.x.x. file dumps”):

‘git clone’ can be used to get a first copy of data
‘git pull’ can be used repeatedly to obtain subsequent updates

This method takes advantage of a lot of existing software and infrastructure, mainly Git versioning tool and for example GitHub (or GitHub like) repositories and is most suitable to software developers and subset of data analysts who already use such tools.

5.3.5. Interaction with other modules

1. ODN/Management initiates any publication process via Publication API of ODN/Publication. ODN/Publication module uses ODN/Storage to get the data which should be published.

2. ODN/Management uses Management GUI of ODN/Publication to set up the settings for creation of CSV/RDF dumps, settings for generating REST APIs, settings for preparing SPARQL endpoint.

3. ODN/Publication react to notifications from ODN/Storage by for example recreating file dumps or invalidating cached information for updated datasets.

4. Data consumers may (1) download CSV/RDF dumps, (2) use SPARQL endpoints, (3) use REST APIs.

Page tree