Architecture and Design - Open Data Node Modules - Storage

5.2. Module ODN/Storage

The purpose of this module is to store the transformed data produced by ODN/UnifiedViews. ODN/Publication module uses ODN/Storage to get the transformed data, so that it can be published - provided to data consumers.

5.2.1. Structure of the ODN/Storage and its context

Two important components of ODN/Storage are:

RDBMS data mart
RDF data mart

5.2.1.1. RDBMS data mart

RDBMS data mart is a tabular data store, where data is stored when data publisher wants to prepare CSV dumps of the published dataset or provide REST API for data consumers.

ODN/Storage will use SQL relational database (such as MySQL, PostgreSQL, etc.) for storing tabular data.

Every transformation pipeline can contain one or more Tabular data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDBMS data mart. Every loader loads data into a single table. The name for the table is prepared by ODN/UnifiedViews and is based on the dataset ID and ID of the tabular data mart loader DPU.

Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which tables should be published by (1) manually specifying all the tables which should be published or by (2) specifying that all results of certain transformation pipeline should be published.

To support the above feature, data being stored to RDBMS data mart must be associated with metadata holding for every table at least:

to which dataset the table belongs to
which transformation pipeline produced the table

Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) as the only RDBMS implementation. As part of ODN, we will employ JDBC to add support for wider range of databases. Testing and validation will be done based on feedback from users (currently we plan to work also with PostgreSQL).

5.2.1.2. RDF data mart

Data is stored in RDF data mart when data publisher wants to prepare for data consumers RDF dumps of the published dataset or provide SPARQL endpoint on top of the published dataset.

Every transformation pipeline can contain one or more RDF data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDF data mart. Every RDF data mart loader loads data to a single RDF graph. RDF graph represents a context for RDF triples, graph is a collection of RDF triples produced by one RDF data mart loader. The name for the RDF graph is prepared by ODN/UnifiedViews and is based on the dataset ID and ID of the RDF data mart loader DPU.

Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which RDF graphs should be published by (1) manually specifying all the graphs which should be published or by (2) specifying that results of certain transformation pipeline should be published.

To support the above feature, data being stored to RDF data mart must be associated with metadata holding for every RDF data graph at least:

to which dataset the graph belongs to
which transformation pipeline produced the graph

Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) and Sesame (http://www.openrdf.org/) as RDF data mart implementation. As part of ODN, we will employ SAIL API to add support for wider range of triplestores. Testing and validation will be done based on feedback from users.

5.2.2. Interaction with other modules

1. Every transformation pipeline (ODN/UnifiedViews) can contain one or more RDF/RDBMS data mart loaders - DPUs, which load data resulting from the transformation pipeline to the corresponding data mart (RDF/RDBMS).

2. ODN/Storage notifies ODN/Publication about changes which happened (dataset updates, etc.) so that ODN/Publication can adapt to the changes.

3. ODN/Publication uses data marts to get required graphs/tables to be published (exported as RDF/CSV dumps, made available via REST API/SPARQL Endpoint). ODN/Publication selects the relevant graphs/tables based on the data publishers preference and metadata associated with tables/graphs.

3. ODN/Management may query ODN/Storage to get statistics about stored data, at least:

How many RDF graphs/tables is stored in RDF/RDBMS data mart in total/for the given dataset ID?
How many RDF triples are stored in certain RDF graph in RDF data mart?
How many records are in certain table in RDBMS data mart?

Page tree