This document describe architecture and design of Open Data Node. As of now (June 2014) it contains mainly information "as planned". Later on, as the development will continue, it will be maintained to describe "as is" status.
Document is maintained as set of Wiki pages, one page per section. One page output is then constructed using include macro.
When adjusting images, use ODN-arch-all.odg and reupload the adjusted sources.
1. Executive Summary
The purpose of this document is to define architecture and design principles of the system to fulfill requirements (for example as defined in COMSODE Deliverable 2.1 ‘User requirements for the publication platform from target organizations, including the map of typical environments’).
The document describes architecture of a software solution for open data publication (Open Data Node) which was developed by COMSODE project and is basis also for methodologies for Open Data publication (which will also developed by COMSODE project). Architecture was developed based on identified user requirements (Use cases).
The system is split into modules that communicates using interfaces. Modules and interfaces are described in more details in particular sections. For each module it is described what is the role of the module and how it fits into context of overall system, including also description of its interactions with other modules. Some modules are describes also with alternatives that can be used to fulfill the functionality of the module.
Due to modularity of the system, it is possible to use the system in different deployment environments and also to use only some modules of the system and integrate with other modules and systems. Some deployment schemes are also described in the document.
Modules are licensed as Open Source, thus external contributors (essentially whoever) can improve the modules or extend the system. To facilitate that, also description of development model and environment is included in the document.
This architecture was a base for initial implementation work of the COMSODE project. However, it may happen that some new component will be identified in the future, component which will better fulfill functionality of particular module. In that case, we will update the module description in this document accordingly, including explanation of the replacement.
2. Document context
2.1. Purpose of document
The research and development project COMSODE had as one of the main objectives within its 24 months of duration to create a publication platform called Open Data Node (ODN) that builds on results of previous research and development in the linked data field. Its mission was to bring results from research environment into real-world for people, SMEs and other organizations to use and reuse.
ODN is a foundation for a data integration platform based on Open Data which allows the reuse of data not only between public bodies and end-users but also among public bodies themselves: public bodies can exchange information by using the same infrastructure and tools as end-users which can decrease costs of exchanging the data and in most cases also enhance the quality and speed of the exchange.
This document represents the output of Task 2.2 - Architecture and design of the publication platform (ODN) of the COMSODE project and is maintained since then to reflect changes.
The purpose of the document is to provide overview of the architecture of Open Data Node, its modules, dependencies between modules and main communication interfaces.
For particular modules you can find description of their functionality and also overview of design of the modules.
Finally, it also describes possible deployment environments where ODN can be installed.
2.2. Related Documents
COMSODE DOW, version date 2013-08-08, pages 7-8 and 42-43
- COMSODE Deliverable 2.1: User requirements for the publication platform from target organizations, including the map of typical environments
3. Methodology used
The architecture of the system and design were created in an iterative way – from outlines in the OpenData.sk Wiki, through COMSODE (DoW, the meetings and discussions within consortium and with User Board, to Deliverable D2.3) to current form described in this document. We took into account also previous development of components of the system, as some of them existed before the COMSODE project started or evolved after COMSODE project ended.
The COMSODE project used internal collaborative space - called wiki - based on Atlassian Confluence technology with access of all members of the project team. There was established a dedicated space in the internal wiki for collection of inputs to the architecture and design from consortium members and from members of User Board. Also input (obtained by means other than wiki) from possible users of the platform was included in use cases. As part of final COMSODE outputs, ODN documentation was copied into this public Wiki hosted by OpenData.sk community.
ODN platform is planed to reuse, integrate and extend Open Source components, and architecture is based on the current status of those components and also on what additional features are possible to be included in those components.
It is expected, that the document will be updated when significant design changes are required due to changes in existing components or when a new component that better fits the project needs becomes available and can replace the currently selected component.
3.2. Partner contributions
The architecture and design document was prepared under management of COMSODE consortium member - EEA Ltd. Tomas Knap and Peter Hanečák are architects of ODN and UnifiedViews and DSE CUNI provided main input to Use cases. Project coordinator UNIMIB, DSE CUNI and ADDSEN have reviewed the pre-final version of the document.
After end of COMSODE project, EEA is helping to maintain this documentation.
ODN performs extraction and transformation (conversion, cleansing, anonymization, linking etc.) of data provided by governments and municipalities. ODN stores the result of data transformations in its own storage. ODN publishes the results of data transformations in open and machine readable formats to data consumers (citizens, companies, governments).
Main actors dealing with the system
government, municipalities, and other subjects providing data
government, municipalities, non-profit organizations (NGOs), citizens (general public), companies (SMEs), application developers consuming the transformed data
administrators, analysts, data curators, etc. (partially) responsible for configuring ODN - usually employees of data publisher
IT staff responsible for installation, maintenance and (partially) configuration of ODN - usually employees of data publisher
ODN helps publishers with the complexity of source data and their transformations to open data and deliver easy-to-use and high quality open data to the data consumers.
ODN helps data consumers get the data easily and efficiently in open, machine readable formats.
Input to the system is data of data publishers stored in heterogeneous environments, using wide variety of formats and employing a lot of different technologies to access and process that data
Output of the system are published open data in various forms, as linked data or as tabular data. Also API access to the data is included. Data consumer may be provided with:
RDF data as a result of export from the storage
RDF data as a result of SPARQL Query
CSV data as a result of export from the storage
REST API to access data in the storage
All forms of the data shall be available for data consumers under an open license. For more details about the formats of published open data, please see Section 4.3.
4.4. Features for data publishers
automated and repeatable data harvesting: extraction and transformation (conversion, cleansing, anonymization, etc.) of data, both:
initial harvesting of whole datasets (first import)
periodical harvesting of incremental updates
integration tools for extracting data from publisher’s internal systems (e.g. databases, data in files, information systems with API, etc.)
internal storage for data and metadata; the metadata format will be based on DCAT (http://www.w3.org/TR/vocab-dcat/)
data publishing in open and machine readable formats to the general public and businesses including automated efficient distribution of updated data and metadata (dataset replication)
integration with data catalogs (like CKAN) for automated publication and updating of dataset metadata
internal data catalog of datasets for maintenance of dataset metadata
4.5. Features for data consumers
Features for data consumers are discussed separately for different types of data consumers.
4.5.1. Citizen, data analyst, etc.
user is typically accessing ODN instance maintained by someone else, user it not running his own instance
user may download data dumps and call APIs to get the data which he is interested in
data dumps changes are advertised as Atom feeds
user may access the data indirectly, for example via 3rd party data catalog, which - in order to show the user preview or visualization of data - has to first download that data (in a similar manner as if user was accessing it: i.e. downloading a dump or accessing an API from ODN instance maintained by someone else)
4.5.2. Aggregator of Open Data (public body, NGO, SME, etc.)
the same as in Section 3.5.1
aggregator may easily replicate content in another Open Data Node
aggregator may automate data integration and linking
4.5.3. Application developer using Open Data (SME, NGO, etc., public body too)
the same as in Section 3.5.2
application developer has tools for preparing automated generation of API and custom API of the published datasets
4.5.4. Data administrator
possibility to set up the data extraction, transformation, and publication of open data
possibility to monitor execution of data extraction and transformatio tasks
possibility to debug data extraction and transformation
4.6. Use Cases
In this deliverable, we depict use cases for data publishers and data consumers. Full list of use cases, also with related scenarios and mock-ups can be found at https://team.eea.sk/wiki/display/COMSODE/Use+Cases (note: This is the internal consortium wiki space mentioned in section “2.1. Methodology” - it is subject to change based on subsequent user requirements and will be most probably moved to public space once community around ODN grows).
For each use case below, we introduce its name, short description and modules in ODN participating on this use case (ODN/M = ODN/Management module, ODN/UV = ODN/UnifiedViews, ODN/P = ODN/Publication module, ODN/IC = ODN/InternalCatalog module, ODN/C = ODN/PublicCatalog module). For full details, please see https://team.eea.sk/wiki/display/COMSODE/Use+Cases .
4.6.1. Use Cases for Data Publisher
Create dataset record
As a data publisher I want to create new record about the intended published data, so that I can define for every dataset information about the source data, intended transformations and ways how the transformed data should be published
Edit/Manage dataset record
As a data publisher I want to edit/manage dataset records
Delete dataset record
As a data publisher I want to delete outdated/obsolete dataset record
As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.)
Configure transformation using wizard
As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.) using a wizard, so that it is really simple to prepare typical dataset transformation
As a data publisher I want to configure how the transformed dataset is published, thus, how the dataset may be consumed by data consumers
As a data publisher I want to publish the dataset
As a data publisher I want to transform the dataset (the dataset is transformed but not published yet)
Debug dataset transformation
As a data publisher I want to debug the dataset transformation, see intermediate results of the transformation, see debug message illustrating what happened during the dataset transformation
Configure creation of RDF dumps
As a data publisher I want to configure creation of RDF dumps from my published datasets
Configure creation of CSV dumps
As a data publisher I want to configure creation of CSV dumps from my published datasets
Configure publishing data via REST API
As a data publisher I want to configure how the REST API is generated on top of my published data, which data is accessible via REST API, which users may use REST API, which methods of accessing my data is available to data consumers
Configure publishing to SPARQL Endpoint
As a data publisher I want to configure how data consumers may connect to SPARQL endpoints with my published data
Schedule dataset publication
As a data publisher I want to automate the publication process, so that it can run every week or everytime when new version of dataset is available
Schedule dataset transformation
As a data publisher I want to automate the transformation part of the publication process, so that it can run every week or everytime when new version of dataset is available
Monitor data publishing tasks
As a data publisher I want to monitor the publishing task to see how the data transformation and data publishing were executed for my datasets
Basic overview about the transformation pipelines' execution
As a data publisher I want to monitor the publication of the dataset to see whether the publication was OK, or there were some errors
Detailed overview about the transformation pipelines' execution
As a data publisher I want to see the detailed overview about the transformations of the dataset
Browse transformation logs/events
As a data publisher I want to browse logs and events to see in detail what happened during the dataset transformation
Browse intermediate data
As a data publisher I want to browse the intermediate data produced as the dataset is being transformed
Overview about the publication of the transformed data
As a data publisher I want to be informed about the publication of the transformed dataset, whether there were some problems or not
Schedule publishing of transformed RDF data
As a data publisher I want to automate the publishing of the transformed datasets; typical requested behaviour: whenever the dataset is transformed, it should be also published
Publish transformed dataset
As a data publisher I want to publish the dataset, which has already been transformed
4.6.2. Use Cases for Data Consumers
Consume RDF data dump
As a data consumer I want to download RDF data dump, so that I can load it to my data store and work with it
Consume CSV data dump
As a data consumer I want to download CSV data dump, so that I can load it to my data store and work with it
Consume version of the data dump valid at certain time
As a data consumer I want to get the data dumps valid at certain time in the past
Query SPARQL Endpoint
As an advanced data consumer I want to query RDF data directly using SPARQL endpoint
Use REST API
As a data consumer I want to use REST API, so that I can work with the data from my app
Browse Data Catalog
As a data consumer I want to browse and search the list of datasets (data catalog)
Get metadata about dataset
As a data consumer I want to get metadata of the published dataset
Browse (sample) data
As a data consumer I want to browse data sample to get idea what is in the dataset
Data dump changes
As a data consumer I want to be notified (for example via RSS or Atom) when a data dump is updated or changed
4.7. Other inputs for Architecture decisions
This is additional list of inputs extending Deliverable 2.1: ‘User requirements for the publication platform from target organizations, including the map of typical environments’.
The system must be extensible with new DPU on the data transformation (ETL) pipelines.
The overhead of managing data transformation (ETL) tasks and managing outputs from/preparing inputs to DPUs must be reasonable - any execution of ETL task must not take more than 200% of the time needed for manual execution of the DPUs’ logic without the use of the ETL framework (under the condition than the pipeline is running alone)
The system must be able to process big files - CSV files containing millions of rows, RDF files containing hundreds of millions of triples.
Response time: The Web management GUI of ODN must respond in 99.9% cases in less than 1s (all components of ODN on one machine, client is connecting to server using 100Mb line at least).
Target platform: Linux, Windows
Preferred languages for the system implementation: Java, others only in case of reuse of existing components with sufficient added-value (for ODN and for ODN users)
- The internal format for all data being transformed during the ETL process is RDF, an universal machine readable data format.
5. Open Data Node Modules
Open Data Node consists of the following modules:
Modules listed above are discussed in more detail in the following sections.
5.1. Module ODN/UnifiedViews
Module ODN/UnifiedViews is an ETL& data enrichment tool.
It is responsible for extracting and transforming source data (datasets), so that they can be published as (linked) open data. The result of the transformation is stored in the database managed by ODN/Storage module.
ODN/UnifiedViews module is responsible for:
extracting data provided by data publishers
transforming these data to machine readable data format; such transformation may include enriching the data, cleansing the data, assessing the quality of the data
storing the machine readable data to the database managed by ODN/Storage.
Input of the module is the data provided by data publishers. Data is expected to be structured, mostly tabular or linked data (RDF). Module will support basic data formats out of the box, support for more complex data formats is available via plugins.
Module will work with different formats (in files), but preferred is data in RDF format. RFD format will allow usage of advanced data cleansing and enrichment techniques based on linked data also for use cases where output will not be in RDF (i.e. for example cases where ODN will be used to clean CSV files before publishing).
Output of the module is the extracted and transformed machine readable data stored in ODN/Storage. Again, data is expected to be structured, tabular or linked data.
5.1.1. UnifiedViews - state of the art
Module ODN/UnifiedViews will use as its base the tool UnifiedViews (https://github.com/UnifiedViews). It is an ETL framework with a native support for transforming RDF data. UnifiedViews allows users to define, execute, monitor, debug, schedule, and share data transformation tasks.
UnifiedViews was originally developed as a student project at Charles University in Prague and now it is maintained by Semantica.cz, Czech Republic, Semantic Web Company, Austria, and EEA, Slovak Republic.
UnifiedViews allows users to define and adjust data processing tasks (pipelines) using a graphical user interface (see Figure below); the core components of every data processing task are data processing units (DPUs). DPUs may be drag&dropped on the canvas where the data processing task is constructed. Data flow between two DPUs is denoted as an edge on the canvas; a label on the edge clarifies which outputs of a DPU are mapped to which inputs of another DPU. UnifiedViews natively supports exchange of RDF data between DPUs; apart from that, files may be exchanged between DPUs.
UnifiedViews takes care of task scheduling. Users can plan executions of data processing tasks (e.g., tasks are executed at a certain time of the day) or they can start data processing tasks manually. UnifiedViews scheduler ensures that DPUs are executed in the proper order, so that all DPUs have proper required inputs when being launched.
A user may configure UnifiedViews to get notifications about errors in the tasks' executions; user may also get daily summaries about the tasks executed.
To simplify the process of defining data processing tasks and to help users analyzing errors during data processing task executions, UnifiedViews provides users with the debugging capabilities. Users may browse and query (using SPARQL query language) the RDF inputs to and RDF outputs from any DPU.
UnifiedViews framework also allows users to create custom plugins - data processing units (DPUs). Users can also share DPUs with others together with their configurations or use DPUs provided by others.
Technical structuring and licensing of UnifiedViews allows DPUs to be licensed not just as Open Source, but also using proprietary license. This is a planned feature of the tool needed by use cases where commercial exploitation is needed. ODN will support same commercial use cases.
184.108.40.206. UnifiedViews components and dependencies
Figure below depicts current maven modules in UnifiedViews and its dependencies. Modules in the yellow box are visible to DPU developers. The most important modules are:
frontend - Management GUI of UnifiedViews
backend - Engine running the data transformation tasks
commons-app - DAO & Services module, which is common to frontend and backend modules; it is used to store configuration for pipelines, DPUs, pipeline executions etc.
dataunit-rdf, dataunit-file - Modules with interfaces for data units; DPU developers writing new DPUs use these modules to read data from input data units and write data to output data units
5.1.2. Structure of the ODN/UnifiedViews and its context:
ODN/UnifiedViews comprises of the important components as follows:
DAO & Service - used to access database where configuration of ETL tasks and its executions is stored (realized by module commons-app in Figure XX - from chapter 220.127.116.11.)
HTTP REST Transformation API - Services from DAO & Services layer exposed as HTTP REST methods. Used by ODN/Management module (this component is not realized by any module in Figure XX)
Data Processing Engine - Robust engine running the manually launched or scheduled transformation tasks - transformations may include data cleansing, linking, integration, quality assessment (realized by “backend” module in Figure XX)
Management GUI - GUI used to manage the configuration of pipelines, debugging executions, etc. (realized by “frontend” module in Figure XX)
5.1.3. Interaction with other modules
1. ODN/UnifiedViews loads the transformed data to ODN/Storage. A special DPUs - RDF data mart loader and Tabular data mart loader must be provided to load transformed data to ODN/Storage to the corresponding data store. The data must be stored there together with metadata, so that ODN/Publication module knows which resources (tables, graphs) are associated with which pipeline/dataset.
2. ODN/UnifiedView will provide RESTful management API, which will be used by ODN/Management to:
create new data transformation task (pipeline)
configure existing pipeline and get configuration of the pipeline
delete the pipeline
execute the pipeline
schedule the pipeline
An excerpt of the methods, which will be available to ODN/Management in a RESTful format is depicted below:
3. Management GUI of ODN/UnifiedViews is used by ODN/Management to:
show the pipeline detail in an expert mode (user may drag&drop DPUs, fine-tune pipeline configuration)
show the detailed results of pipeline executions (browse events/logs)
debug data being passed between DPUs
have an access to advanced scheduling options
5.2. Module ODN/Storage
The purpose of this module is to store the transformed data produced by ODN/UnifiedViews. ODN/Publication module uses ODN/Storage to get the transformed data, so that it can be published - provided to data consumers.
5.2.1. Structure of the ODN/Storage and its context
Two important components of ODN/Storage are:
RDBMS data mart
RDF data mart
18.104.22.168. RDBMS data mart
RDBMS data mart is a tabular data store, where data is stored when data publisher wants to prepare CSV dumps of the published dataset or provide REST API for data consumers.
ODN/Storage will use SQL relational database (such as MySQL, PostgreSQL, etc.) for storing tabular data.
Every transformation pipeline can contain one or more Tabular data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDBMS data mart. Every loader loads data into a single table. The name for the table is prepared by ODN/UnifiedViews and is based on the dataset ID and ID of the tabular data mart loader DPU.
Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which tables should be published by (1) manually specifying all the tables which should be published or by (2) specifying that all results of certain transformation pipeline should be published.
To support the above feature, data being stored to RDBMS data mart must be associated with metadata holding for every table at least:
to which dataset the table belongs to
which transformation pipeline produced the table
Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) as the only RDBMS implementation. As part of ODN, we will employ JDBC to add support for wider range of databases. Testing and validation will be done based on feedback from users (currently we plan to work also with PostgreSQL).
22.214.171.124. RDF data mart
Data is stored in RDF data mart when data publisher wants to prepare for data consumers RDF dumps of the published dataset or provide SPARQL endpoint on top of the published dataset.
Every transformation pipeline can contain one or more RDF data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDF data mart. Every RDF data mart loader loads data to a single RDF graph. RDF graph represents a context for RDF triples, graph is a collection of RDF triples produced by one RDF data mart loader. The name for the RDF graph is prepared by ODN/UnifiedViews and is based on the dataset ID and ID of the RDF data mart loader DPU.
Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which RDF graphs should be published by (1) manually specifying all the graphs which should be published or by (2) specifying that results of certain transformation pipeline should be published.
To support the above feature, data being stored to RDF data mart must be associated with metadata holding for every RDF data graph at least:
to which dataset the graph belongs to
which transformation pipeline produced the graph
Note: Currently, UnifiedViews supports Openlink Virtuoso (http://virtuoso.openlinksw.com/) and Sesame (http://www.openrdf.org/) as RDF data mart implementation. As part of ODN, we will employ SAIL API to add support for wider range of triplestores. Testing and validation will be done based on feedback from users.
5.2.2. Interaction with other modules
1. Every transformation pipeline (ODN/UnifiedViews) can contain one or more RDF/RDBMS data mart loaders - DPUs, which load data resulting from the transformation pipeline to the corresponding data mart (RDF/RDBMS).
2. ODN/Storage notifies ODN/Publication about changes which happened (dataset updates, etc.) so that ODN/Publication can adapt to the changes.
3. ODN/Publication uses data marts to get required graphs/tables to be published (exported as RDF/CSV dumps, made available via REST API/SPARQL Endpoint). ODN/Publication selects the relevant graphs/tables based on the data publishers preference and metadata associated with tables/graphs.
3. ODN/Management may query ODN/Storage to get statistics about stored data, at least:
How many RDF graphs/tables is stored in RDF/RDBMS data mart in total/for the given dataset ID?
How many RDF triples are stored in certain RDF graph in RDF data mart?
How many records are in certain table in RDBMS data mart?
5.3. Module ODN/Publication
Module responsible for publishing data via REST APIs, SPARQL endpoint or as data dumps in RDF or CSV formats. Published data is already transformed as defined by data transformation pipelines in ODN/UnifiedViews and stored in ODN/Storage.
The module allows data administrators/publishers to select how the published datasets are provided to data consumers; in particular, ODN/Publication module allows users to select:
publication of the dumps (CSV for tabular data, RDF for linked data),
publication via API (SPARQL Endpoint for RDF data, REST API for tabular data).
Data administrators/publishers may also configure some specific settings per each publication option: to tweak dump generation process (like which RDF serialization to use: Turtle, XML, etc.), to select which resources (tables, graphs) associated with the transformed dataset (and stored in ODN/Storage) should be published - made available to data consumers, etc.
5.3.1. Structure of the ODN/Publication and its context
ODN/Publication comprises of the important components as follows:
DAO & Service layer - used to access database where configuration and results of publication tasks are stored
Publication Management API which is called by ODN/Management when certain dataset should be published or when certain methods of data consumption (REST API, SPARQL Endpoint, dumps) should be enabled or disable
Publication Engine - module, which is responsible for:
creating dumps for the given dataset
configuring SPARQL endpoint/REST API for the given dataset
Management GUI - GUI used to manage the configuration of the ODN/Publication module
Note: As part of data publication, some metadata will be published by this module too (for example “Last Modification Time” will be included in appropriate HTTP headed in response). But publication of metadata is mainly responsibility of ODN/PublicCatalog (see section 4.5).
5.3.2. File dumps
ODN/Publication module supports creation of file dumps in CSV or RDF formats. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, CSV or RDF dump may be created. The dump in the CSV/RDF fromat is created if the the data publisher decides so.
To create the dump, ODN/Publication module exports the desired data in ODN/Storage. Afterwords, the dump is versioned using Git (http://git-scm.com/). Git allows data consumers to work with the latest or any other previous version of the dataset. ODN/Publication also publishes metadata of the dump, which are obtained from ODN/InternalCatalog.
Finally, new entry in the Atom feed (http://en.wikipedia.org/wiki/Atom_(standard)) associated with the processed dataset is created; such feed points data consumers to the file(s) in the git repository, where the published data and metadata is. Such feed must be reachable from the dataset record in the ODN/PublicCatalog module.
126.96.36.199. RDF dumps
RDF dump may be published only if the result of the dataset transformation is available in RDF data mart in ODN/Storage.
To create the dump, ODN/Publication queries the RDF data mart via SPARQL construct query to get dump in N-Triples (http://www.w3.org/TR/2014/REC-n-triples-20140225/) RDF serialization format. We use N-Triples as RDF serialization format, because it is line oriented serialization format which may be easily versioned by Git.
188.8.131.52. CSV dumps
CSV dump may be published only if the result of the dataset transformation is available in RDBMS data mart in ODN/Storage.
To create the dump, ODN/Publication module exports the desired table in RDBMS data mart as CSV dump.
5.3.3. SPARQL endpoint
ODN/Publication module supports publication of data via SPARQL endpoints. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, data may be made available via SPARQL endpoint. Data is made available via SPARQL endpoint only if the data publisher decides so. Data may be made available via SPARQL endpoint only if the result of the dataset transformation is available in RDF data mart.
To make the data available via SPARQL endpoint, ODN/Publication module provides data consumers with a simple querying interface, where data consumer may query the published data and associated metadata (obtained from ODN/InternalCatalog) using SPARQL query. There is no versioning in this case, only latest data is available via SPARQL endpoint.
5.3.4. Rest API
ODN/Publication module supports creation of REST APIs for data consumption. When dataset is transformed, it is being published. As part of the publishing of the transformed dataset, REST API may be generated for the published data. REST API is generated if the data publisher decides so. REST API may be generated only if the result of the dataset transformation is available in RDBMS data mart.
API is based on “Representational state transfer” software architectural style (https://en.wikipedia.org/wiki/Representational_State_Transfer) and - for the purpose of Open Data - will provide read-only functionality: Users will be able to get the data from datasets using HTTP protocol, getting results in JSON, XML, CSV or RDF formats based on their preference.
API is intended to be used by programmers or similarly skilled users who can develop software or scripts. But given the truly simplistic nature of this kind of API, even causal user can work with it using common web browser.
There is no versioning in this case, only latest data is available via REST API.
4.3.6. Dataset replication
Automated efficient distribution of updated data and metadata will be achieved by careful implementation of two main methods mentioned earlier, e.g. file dumps and REST API, complemented with third option based on Git.
First two options are generic and interoperable: they will work regardless of the exact tool being used to replicate the data. At one end there will be ODN, on the other hand it can be anything.
The rest is sort of proprietary: technically based on open formats and protocols but limited to smaller/niche audiences.
Note: There is possibility also for fourth option based on combination of file dumps and peer2peer technologies (like BitTorrent). As of now we do not register a demand for that so it is not in the scope of the development.
184.108.40.206. Via file dumps
Proper publishing of file dumps, along with increments and Atom feeds, combined with proper usage of features of HTTP protocol (cache related headers, range requests, if-modified-since headers etc.) is one option.
220.127.116.11. Via REST API
REST API is another option, but that requires presence of “last modified” (or similar) fields within datasets at the line/record level.
Those two options are generic and interoperable: they will work regardless of the exact tool being used to replicate the data. At one end there will be Open Data Node, on the other hand ca be anything.
18.104.22.168. Via Git
Third option is to take advantage of Git versioing (see section “x.x.x. file dumps”):
‘git clone’ can be used to get a first copy of data
‘git pull’ can be used repeatedly to obtain subsequent updates
This method takes advantage of a lot of existing software and infrastructure, mainly Git versioning tool and for example GitHub (or GitHub like) repositories and is most suitable to software developers and subset of data analysts who already use such tools.
5.3.5. Interaction with other modules
1. ODN/Management initiates any publication process via Publication API of ODN/Publication. ODN/Publication module uses ODN/Storage to get the data which should be published.
2. ODN/Management uses Management GUI of ODN/Publication to set up the settings for creation of CSV/RDF dumps, settings for generating REST APIs, settings for preparing SPARQL endpoint.
3. ODN/Publication react to notifications from ODN/Storage by for example recreating file dumps or invalidating cached information for updated datasets.
4. Data consumers may (1) download CSV/RDF dumps, (2) use SPARQL endpoints, (3) use REST APIs.
5.4. Module ODN/InternalCatalog
Module ODN/InternalCatalog is the first module and main module with which data publishers interact while working with ODN. It encapsulates the functionality of data catalog but this functionality is used to manage datasets which should be transformed/published by ODN; it also allows data publishers to see details about the transformation/publishing process. It is an internal catalog, thus, it is not visible to public, but only data publisher/data administrator can use the catalog.
5.4.1. Interaction with other modules
Pipelines in ODN/UnifiedViews can create and update resources (metadata and data) in datasets maintained in ODN/InternalCatalog (based on associations between datasets and pipelines).
ODN/InternalCatalog automatically replicates data and metadata for datasets and resources into ODN/PublicCatalog, but (of course) only in cases when datats are marked as "public" (i.e. suitable and intended for publication for data users).
5.5. Module ODN/PublicCatalog
ODN/PublicCatalog is the second module which encapsulates the functionality of data catalog. ODN/PublicCatalog holds metadata about each dataset, which is published by ODN. This data catalog is publicly visible, the primary users of this catalog are data consumers, who may browse/search the published datasets’ metadata; data consumer may also get a link to the dataset’s dump or API, so that they can consume the data in the dataset.
ODN/PublicCatalog implements also REST API for datasets where data publisher chose to do so. This is sort of "not nice" from design perspective (as this mixes the sepration between ODN/Publication = data and ODN/PublicCatalog = metadata) but it is practical (it allows ODN to use REAST API features implemented in CKAN, avoiding duplicate implementation).
5.5.1. Interaction with other modules
ODN/InternalCatalog is replicating datasets marked as public into this module.
As part of publication functionality, ODN/PublicCatalog may link to ODN/Storage (specifically to Virtuoso based SPARQWL endpoint GUI) - or ODN/Publication (links to file dumps served from ODN/Storage by Apache).
5.6. Module ODN/Management
Module is responsible for management of all components which form so called internal part of ODN (i.e. ODN/InternalCatalog and ODN/UnifiedViews, including also ODN/Management itself). As each of those components has its own web GUI which requires authentication, this module provides mainly:
- user management - via usage of midPoint (https://evolveum.com/midpoint/)
- Single Sign-On (SSO) - using CAS (https://www.apereo.org/)
TODO: Further details.
The main target development platform is Java. ODN/Publication may use PHP tool for generating APIs. Similarly, ODN/InternalCatalog and ODN/PublicCatalog use already existing Python based tool CKAN to ensure the data catalog functionality.
ODN must be able to run on both Linux and Windows operating systems.
6.1. Development process
Source codes related to ODN are kept in github under github project https://github.com/OpenDataNode. Specific repositories are created for particular module.
- ODN/InternalCatalog for ODN specific extensions: https://github.com/OpenDataNode/InternalCatalog
- UnifiedViews plugins specific for ODN: https://github.com/OpenDataNode/UV-Plugins
Certain modules are developed under a separated github projects and are reused in ODN. Any changes to that modules are requested using issues in particular project or as a pull request to proper repository on github.
- ODN/UnifiedViews https://github.com/UnifiedViews
- ODN/InternalCatalog and ODN/PublicCatalog core part and extensions: https://github.com/ckan
In exceptional case we can fork original repository (mainly in case when original repository doesn't accept pull request), but main goal will be to merge changes back later.
For the whole ODN, we will follow the guidelines for sustainable development proposed for developing ODN/UnifiedViews (https://grips.semantic-web.at/display/UDDOC/Guidelines+for+Contributors).
6.2. Used technologies
There are two main technological stacks in the ODN, derived from re-use of UnifiedViews and CKAN in the project:
- Java based stack - for ODN/UnifiedViews, ODN/Management, etc.
- Python based stack - for ODN/InternalCatalog and ODN/PublicCatalog
6.2.1. Java stack
Java based technologies are used for majority of ODN modules. Each module has its own sub-set of technologies, frameworks, libraries and tools. Some of those are mentioned here, but for more concrete and up-to-date information, please refer to homepages and documentation of upstream projects (TODO: add Wiki page listing all main upstram projects and link here).
Spring is an open source application framework and inversion of control container for Java platform. The core features of the Spring Framework can be used by any Java application, but there are extensions for building web applications on top of the Java EE (enterprise) platform. Spring will be used in all ODN modules implemented in Java.
22.214.171.124. Sesame openRDF
Sesame openRDF is an open source framework for processing RDF data. This includes parsing, storing, inferencing and querying of/over such data. It offers an easy-to-use API that can be connected to all leading RDF storage solutions. It allows to connect to SPARQL endpoints and create applications that leverage the power of linked data and Semantic Web. OpenRDF is used in ODN/UnifiedViews, ODN/Storage, ODN/Publication to work with RDF database.
EclipseLink is the open source Eclipse Persistence Services Project. The software provides an extensible framework that allows Java developers to interact with various data services, including databases, web services, Object XML mapping (OXM), and Enterprise Information Systems (EIS). EclipseLink is used to persist data objects in ODN/UnifiedViews.
XStream is a simple library to serialize objects to XML and back. XStream uses reflection to discover the structure of the object graph to serialize at run time and does not require modifications to objects. It can serialize internal fields, including private and final, and supports non-public and inner classes. XStream is used in ODN/UnifiedViews to store pipeline configurations.
You can draw things onto the stage, add event listeners to them, move them, scale them, and rotate them independently from other shapes to support high performance animations, even if your application uses thousands of shapes.
126.96.36.199. OSGI framework
ODN/UnifiedViews must support easy and smooth extension with custom DPUs added into the running application. Every DPUs may use its set of libraries. These libraries must not be in conflict. To ensure that, we use OSGI framework, in particular the Apache Felix implementation (http://felix.apache.org/).
6.2.2. Python stack
TODO: Add some more details about Python itself (version supported, etc.) and frameworks used in CKAN, etc.
6.3. Development tools
Maven (http://maven.apache.org/) is used for management of dependencies in the source code, portability between Java IDEs and easy application build.
6.3.2. Git + GitFlow
Git (http://git-scm.com/) is a version control system used for source code version control management and tracking changes.
GitFlow (https://github.com/nvie/gitflow) is a collection of Git extensions to provide high-level repository operations for Vincent Driessen's branching model (http://nvie.com/git-model). Nice overview of whole worklow is available at http://danielkummer.github.io/git-flow-cheatsheet/
SourceTree (http://www.sourcetreeapp.com/) is recommended GUI client (available for MS Windows and MAC OSX) as it has good support for defined gitflow workflow. For linux, command line is necessary to use to follow gitflow workflow.
6.3.3. IDE for developers
Project is not strictly bounded to specific IDE, but Eclipse (http://www.eclipse.org/), Netbeans (https://netbeans.org/) or InteliJ Idea (http://www.jetbrains.com/idea/) are recommended IDEs for developing Java applications.
Eclipse and Netbeans have also plugins for development of other type of applications (for example php/Drupal based).
Firstly, we describe basic deployment scenarios of Open Data Node, considering Open Data Node as a single unit. Afterwards, we discuss deployment of ODN’s modules.
7.1. Basic ODN Deployment Scenarios
Open Data Node can be deployed many times by many actors. ODN can help with needs specific to each particular actor:
government organizations, municipalities, etc., want to publish majority of their information as Open Data
other government bodies need to work with some data published by other government bodies
non-profits and application developers want to run specific tasks using copies of official data, for example analytic and visualization applications, data integration, etc.
This section contains schemes for basic deployment options for ODN. Options are sorted based on achievable publishing quality from best (and most expensive) at the top to worst (and cheapest) at the bottom.
7.1.1. Tight integration, at the publisher's premises
Open Data Node is tightly integrated with publisher's internal application(s) - it has direct access to backend databases or is integrated into application(s) workflows via API. ODN is deployed alongside publisher's internal application(s), it has to respect network, security and other zones.
publisher wants to achieve high quality and efficiency and is willing to invest more
publisher is willing to update existing workflows and applications
7.1.2. Loose integration
Open Data Node is integrated with publisher's application(s) in a loose way using some periodical data dumps or APIs. ODN can be deployed in several locations:
at publisher's premises - access to the data is secured in a similar way as in the case of tight integration in Section 6.1.1.
at collocated housing, for example data center shared by multiple government organizations - access to the source data secured for example using combination of IPsec, HTTPS and access controls (authentication and authorization)
in the cloud - access to the source data secured with just HTTPS and access control (not suitable for sensitive data or sensitive internal systems)
publisher wants to achieve high quality and efficiency but has limited resources so changes to existing infrastructure and applications have to be limited
aggregator is affiliated with one or more publishers and is willing and able to invest into tighter integration with them (ministry aggregating data from municipalities or SME planning to make business using aggregated data)
publisher is able to do minor modifications to existing workflows and applications
7.1.3. No integration, deployed at 3rd parties
Open Data Node is not explicitly integrated with publisher's applications, other existing means are used to get access to the data (either Open Data or other format or API, at worst case scrapping of data from website). ODN is deployed at 3rd party using their own hardware, collocated housing, or Cloud.
publisher wants to publish Open Data but has severely limited resources or options so changes to existing infrastructure and applications have to be preferably none
3rd party aggregator or application developer wants to use data from one or more publishers but for some reason is not able or willing to implement tighter integration with them
some usable form of access to source data is possible without changing existing workflows and applications
7.2. Deployment of ODN Modules
7.2.1. Single Machine
ODN modules may be deployed all on one machine as depicted below. The deployment requires:
Application Server, which supports Java 7 EE applications. ODN will be tested on Apache Tomcat 7+. Application server is needed for management GUIs of ODN modules.
Relational database management system (RDBMS) - Relational database is used mainly for storing configurations of the modules (definitions of data transformation and publishing tasks, configuration of data catalogs, etc.); in these cases, relational database is preferred, because the schema of the configurations is known in advance and should not change much in the future. Relational database is also used by ODN/Storage to store tabular data produced by ODN/UnifiedViews. Decision is still pending about the particular database management system we will use, however, due to object relational mapping frameworks, which abstract the underlying database system, changing the database system during the works on the project is easy and straightforward.
RDF Storage - We may use any RDF store, which is supported by openRDF API (http://www.openrdf.org/), e.g., Openlink Virtuoso (http://virtuoso.openlinksw.com/rdf-quad-store/) or Sesame (http://www.openrdf.org/). RDF store is used by ODN/UnifiedViews to store intermediate results of the data transformations and also by ODN/Storage to store the RDF data produced by ODN/UnifiedViews.
HTTP Server - ODN will be tested with Apache HTTP Server. HTTP Server is required for ODN/InternalCatalog, ODN/PublicCatalog and ODN/Publication.
7.2.2. Distributed Environment
ODN modules support distributed deployment to more physical devices. Basically every artifact depicted in Figure above may be deployed on a different device. If ODN/UnifiedViews - Engine and ODN/UnifiedViews - Management GUI is placed on different devices, administrator has to set up shared network file system both these artifacts may use.
Typically, we expect such deployment in large organizations with more complicated IT architecture driven by (among other things) security requirements. In this case:
ODN/UnifiedViews is expected to be deployed withing internal, restricted segment along with ODN/Management,
ODN/Publication is expected to be in DMZ segment accessible from the outside by general public (like a typical webserver) and
ODN/Storage is expected to be in internal or other appropriate segment, reachable by both ODN/UnifiedViews and ODN/Publication
7.2.3. Custom Environment
As the design of ODN is modular (driven in particular by our vision, plans, engineering experience, etc., but in general also by best practises in software development and Open Source development) individual user will have possibility to not just move the individual modules between multiple machines, but also to skip (do not deploy) certain modules, if they do not need them.
For example, ODN/InternalCatalog and ODN/PublicCatalog may be skipped, in exchange for direct integration between ODN/UnifiedViews and ODN/Publication with the particular national data catalog.
Note: Given proper modular design and implementation, exact options and combinations are wide and depend strongly on particular user and his needs and use cases, so we are not able to (and will not) document all possible options here. Some most common cases will be later explained in ODN documentation and COMSODE Methodology.
Open Data Node as a whole is Free and Open Source software (see https://en.wikipedia.org/wiki/Free_and_open-source_software).
As it re-uses many existing free and open source components, it is not governed by one single license. Majority of components are covered by three basic families of licenses: GPL, APL and BSD. More detailed licensing information about individual ODN modules follows:
In this module, following components are used:
- OpenRDF Sesame, which is licensed under BSD-style License but is reusing also other components with separate licenses, see https://bitbucket.org/openrdf/sesame/src/master/core/LICENSE.txt and https://bitbucket.org/openrdf/sesame/src/master/core/NOTICE.txt
- PostgreSQL, licensed under its own Open Source license similar to BSD and MIT licenses (see https://wiki.postgresql.org/wiki/FAQ#What_is_the_license_of_PostgreSQL.3F)
- Virtuoso Open Source, licensed under GPLv2, for more information please take a look at http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSLicense
8.3 ODN/InternalCatalog and ODN/PublicCatalog
In this module, the principal components are:
- midPoint, which is licensed under APLv2 (see https://github.com/Evolveum/midpoint/blob/master/LICENSE)
- CAS, licensed under APLv2 (see https://www.apereo.org/content/cas-under-apache-20-license)
This module is primarily built on portion of Virtuoso Open Source (SPARQL endpoint and its GUI) - mentioned above in section "ODN/Storage" - and LDVMi.
LDVMi is licensed under under APLv2 (see https://www.apereo.org/content/cas-under-apache-20-license).
8.5 Other modules and components
Each module can be broken down further into even smaller components and libraries. Each such smaller part has the same or other (but compatible) Open Source license.
For example, UnifiedViews uses following components and libraries:
Spring app. platform
Apache License 2.0
Apache License 2.0
Eclipse Public License (EPL)
MIT or GPL Version 2 licenses - https://github.com/ericdrowell/KineticJS/wiki/License
Apache Felix - OSGI framework
Apache License 2.0
As it is hard to maintain complete component and licensing list down to the lowest level, thus we're maintaining only information about components directly utilized in ODN. Please consult documentation of each such major component (mentioned in previous sections) to obtain its more detailed information about its sub-components and licensing information.
Comma separated values (http://en.wikipedia.org/wiki/Comma-separated_values)
Data Processing Unit - component in UnifiedViews that can execute a transformation of data
Data Catalog Vocabulary (http://www.w3.org/TR/vocab-dcat/)
extract, transform, and load (ETL) refers to a process that:
Open Data Node
Web Ontology Language (http://en.wikipedia.org/wiki/Web_Ontology_Language)
Resource Description Framework (http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/)
SPARQL Protocol and RDF Query Language (http://www.w3.org/TR/rdf-sparql-query/)
Uniform resource identifier (http://en.wikipedia.org/wiki/URI)
Unified Views (https://github.com/UnifiedViews)
Unified Views core components (https://github.com/UnifiedViews/Core)
Unified Views plugin components (DPUs) (https://github.com/UnifiedViews/Plugins)
Atom feed - a published list (or "feed") of recent articles or content in a standardized, machine readable format that can be downloaded by programs that use it (http://en.wikipedia.org/wiki/Atom_(standard)) . It is an alternative to RSS.
Data catalog - a database where are stored metadata about datasets.
Data catalog vocabulary (DCAT) - an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web.
Data mart - access layer of the data warehouse environment that is used to get data out to the users (http://en.wikipedia.org/wiki/Data_mart). In ODN it is the database, which is used by data catalogs to get dumps from, by ODN/Publication to provide REST APIgenerate data dumps, provide data as results of API calls. Internal data mart - the database where the results of the ODN/UnifiedViews is stored when the pipeline finishes. see also Data mart
Data Processing Unit - see DPU
Dataset - one or more related input files (http://en.wikipedia.org/wiki/Data_set). In ODN, dataset is processed as input and also ODN produces datasets as output.
Dataset record - record about the dataset containing metadata of the dataset, associated publication pipelines and data catalog resources
Data catalog resource - the particular file (dump), REST API, or SPARQL Endpoint, visualization etc. associated with the given dataset record.
DCAT Distribution - Basic classes in DCAT are "Catalog", "Dataset" and "Distribution". Distribution is equal to catalog resource.
Data Publication Pipeline - oriented acyclic graph, where nodes represent DPU instances, and oriented edges represent data flow between these instances. Every pipeline consists of one or more DPUs. ODN/UnifiedViews supports exchange of RDF data between DPUs.
Data Publication Task - see Data Publication Pipeline
Data unit - an input to DPU or output from DPU is called data unit (DU). Every DPU may have more input data units and more output data units. DPUs may also use different types of DU, e.g., RDF data unit being able to read/write RDF data or File data unit being able to read/write generic files.
DPU - plugin on the data processing pipelines, which executes certain transformation, cleansing, quality assessment on top of the processed data. DPU encapsulates certain business logic needed when processing data (e.g., one DPU may extract data from a SPARQL endpoint or apply a SPARQL query). Every DPU must define its required/optional inputs and produced outputs.
DPU instance - placement of DPU on a pipeline. DPU instance is therefore created when DPU is placed on pipeline canvas.
DPU configuration - associative array of key-value pairs, which customize functionality of DPU instance.
External data catalog - for ODN it is data catalog that is not part of ODN but ODN can be integrated with it. see also Data catalog
Instance Configuration - configuration for specific DPU instance. It is created at the time of placing DPU on the pipeline (canvas) as a copy of Template Configuration
Internal Data catalog - Data catalog that is part of ODN and is visible to public. In ODN is also referenced as Data catalog.
Public Data catalog - Data catalog that is part of ODN and is visible to public. In ODN is also referenced as Data catalog.
RDF - standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. (http://en.wikipedia.org/wiki/Resource_Description_Framework, http://www.w3.org/RDF/)
Resource Description Framework - see RDF
Representational state transfer - (REST) software architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system and also applied to the development of web services. (https://en.wikipedia.org/wiki/Representational_State_Transfer)
SPARQL - (SPARQL Query Language for RDF) query language for databases, able to retrieve and manipulate data stored in RDF format. SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. (http://en.wikipedia.org/wiki/SPARQL, http://www.w3.org/2001/sw/wiki/SPARQL)
Staging database - the database used by ODN/UnifiedViews to store intermediate results of the DPUs on the pipeline
Template Configuration - previously called "default configuration", default DPU configuration associated with each DPU.
Transformation Pipeline - see Data Transformation Pipeline
http://www.comsode.eu/ main page of project COMSODE
https://github.com/UnifiedViews - source codes of ODN/UnifiedViews module
http://ckan.org/ - data catalog CKAN home page
http://nucivic.com/dkan/ - open data platform for cataloging home page
http://www.w3.org/2001/sw/DataAccess/tests/implementations - comparison of different implementations of SPARQL endpoint
http://www.openrdf.org/ - homepage of openRDF Sezame framework for SPARQL
http://theodi.org/blog/git-data-publishing - ideas how to use GitHub for data publishing