Architecture and Design - System overview

Open Data Node (ODN) is a publication platform which provides to governments, municipalities and other subjects (e.g. companies) an easy way of publishing their data as machine readable (linked) open data.

ODN performs extraction and transformation (conversion, cleansing, anonymization, linking etc.) of data provided by governments and municipalities. ODN stores the result of data transformations in its own storage. ODN publishes the results of data transformations in open and machine readable formats to data consumers (citizens, companies, governments).

4.1. Actors

Main actors dealing with the system

Data publishers
government, municipalities, and other subjects providing data
Data consumers
government, municipalities, non-profit organizations (NGOs), citizens (general public), companies (SMEs), application developers consuming the transformed data
Data administrators
administrators, analysts, data curators, etc. (partially) responsible for configuring ODN - usually employees of data publisher
Administrators
IT staff responsible for installation, maintenance and (partially) configuration of ODN - usually employees of data publisher

ODN helps publishers with the complexity of source data and their transformations to open data and deliver easy-to-use and high quality open data to the data consumers.

ODN helps data consumers get the data easily and efficiently in open, machine readable formats.

4.2. Inputs

Input to the system is data of data publishers stored in heterogeneous environments, using wide variety of formats and employing a lot of different technologies to access and process that data

4.3. Outputs

Output of the system are published open data in various forms, as linked data or as tabular data. Also API access to the data is included. Data consumer may be provided with:

RDF data as a result of export from the storage
RDF data as a result of SPARQL Query
CSV data as a result of export from the storage
REST API to access data in the storage

All forms of the data shall be available for data consumers under an open license. For more details about the formats of published open data, please see Section 4.3.

4.4. Features for data publishers

automated and repeatable data harvesting: extraction and transformation (conversion, cleansing, anonymization, etc.) of data, both:

initial harvesting of whole datasets (first import)
periodical harvesting of incremental updates

integration tools for extracting data from publisher’s internal systems (e.g. databases, data in files, information systems with API, etc.)
internal storage for data and metadata; the metadata format will be based on DCAT (http://www.w3.org/TR/vocab-dcat/)
data publishing in open and machine readable formats to the general public and businesses including automated efficient distribution of updated data and metadata (dataset replication)
integration with data catalogs (like CKAN) for automated publication and updating of dataset metadata
internal data catalog of datasets for maintenance of dataset metadata

4.5. Features for data consumers

Features for data consumers are discussed separately for different types of data consumers.

4.5.1. Citizen, data analyst, etc.

user is typically accessing ODN instance maintained by someone else, user it not running his own instance
user may download data dumps and call APIs to get the data which he is interested in
data dumps changes are advertised as Atom feeds
user may access the data indirectly, for example via 3rd party data catalog, which - in order to show the user preview or visualization of data - has to first download that data (in a similar manner as if user was accessing it: i.e. downloading a dump or accessing an API from ODN instance maintained by someone else)

4.5.2. Aggregator of Open Data (public body, NGO, SME, etc.)

the same as in Section 3.5.1
aggregator may easily replicate content in another Open Data Node
aggregator may automate data integration and linking

4.5.3. Application developer using Open Data (SME, NGO, etc., public body too)

the same as in Section 3.5.2
application developer has tools for preparing automated generation of API and custom API of the published datasets

4.5.4. Data administrator

possibility to set up the data extraction, transformation, and publication of open data
possibility to monitor execution of data extraction and transformatio tasks
possibility to debug data extraction and transformation

4.6. Use Cases

In this deliverable, we depict use cases for data publishers and data consumers. Full list of use cases, also with related scenarios and mock-ups can be found at https://team.eea.sk/wiki/display/COMSODE/Use+Cases (note: This is the internal consortium wiki space mentioned in section “2.1. Methodology” - it is subject to change based on subsequent user requirements and will be most probably moved to public space once community around ODN grows).

For each use case below, we introduce its name, short description and modules in ODN participating on this use case (ODN/M = ODN/Management module, ODN/UV = ODN/UnifiedViews, ODN/P = ODN/Publication module, ODN/IC = ODN/InternalCatalog module, ODN/C = ODN/PublicCatalog module). For full details, please see https://team.eea.sk/wiki/display/COMSODE/Use+Cases .

4.6.1. Use Cases for Data Publisher

Use Cases-Core Dataset Management.png

ID	Name	Short Description	Module
UC1	Create dataset record	As a data publisher I want to create new record about the intended published data, so that I can define for every dataset information about the source data, intended transformations and ways how the transformed data should be published	ODN/M ODN/IC
UC2	Edit/Manage dataset record	As a data publisher I want to edit/manage dataset records	ODN/M ODN/IC
UC3	Delete dataset record	As a data publisher I want to delete outdated/obsolete dataset record	ODN/M ODN/IC
UC4	Configure transformation	As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.)	ODN/UV
UC4a	Configure transformation using wizard	As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.) using a wizard, so that it is really simple to prepare typical dataset transformation	ODN/M
UC5	Configure publication	As a data publisher I want to configure how the transformed dataset is published, thus, how the dataset may be consumed by data consumers	ODN/P
UC6	Publish dataset	As a data publisher I want to publish the dataset	ODN/M ODN/IC ODN/UV ODN/P ODN/C
UC7	Transform dataset	As a data publisher I want to transform the dataset (the dataset is transformed but not published yet)	ODN/UV
UC8	Debug dataset transformation	As a data publisher I want to debug the dataset transformation, see intermediate results of the transformation, see debug message illustrating what happened during the dataset transformation	ODN/UV
UC9	Configure creation of RDF dumps	As a data publisher I want to configure creation of RDF dumps from my published datasets	ODN/P
UC10	Configure creation of CSV dumps	As a data publisher I want to configure creation of CSV dumps from my published datasets	ODN/P
UC11	Configure publishing data via REST API	As a data publisher I want to configure how the REST API is generated on top of my published data, which data is accessible via REST API, which users may use REST API, which methods of accessing my data is available to data consumers	ODN/P
UC12	Configure publishing to SPARQL Endpoint	As a data publisher I want to configure how data consumers may connect to SPARQL endpoints with my published data	ODN/P
UC13	Schedule dataset publication	As a data publisher I want to automate the publication process, so that it can run every week or everytime when new version of dataset is available	ODN/M ODN/UV ODN/P
UC14	Schedule dataset transformation	As a data publisher I want to automate the transformation part of the publication process, so that it can run every week or everytime when new version of dataset is available	ODN/M ODN/UV
UC15	Monitor data publishing tasks	As a data publisher I want to monitor the publishing task to see how the data transformation and data publishing were executed for my datasets	ODN/M
UC16	Basic overview about the transformation pipelines' execution	As a data publisher I want to monitor the publication of the dataset to see whether the publication was OK, or there were some errors	ODN/M ODN/UV ODN/P
UC17	Detailed overview about the transformation pipelines' execution	As a data publisher I want to see the detailed overview about the transformations of the dataset	ODN/UV
UC18	Browse transformation logs/events	As a data publisher I want to browse logs and events to see in detail what happened during the dataset transformation	ODN/UV
UC19	Browse intermediate data	As a data publisher I want to browse the intermediate data produced as the dataset is being transformed	ODN/UV
UC20	Overview about the publication of the transformed data	As a data publisher I want to be informed about the publication of the transformed dataset, whether there were some problems or not	ODN/P
UC21	Schedule publishing of transformed RDF data	As a data publisher I want to automate the publishing of the transformed datasets; typical requested behaviour: whenever the dataset is transformed, it should be also published	ODN/P
UC22	Publish transformed dataset	As a data publisher I want to publish the dataset, which has already been transformed	ODN/P

4.6.2. Use Cases for Data Consumers

Use cases - Data Consumer Use Cases.png

ID	Name	Short Description	Module
UC101	Consume RDF data dump	As a data consumer I want to download RDF data dump, so that I can load it to my data store and work with it	ODN/P
UC102	Consume CSV data dump	As a data consumer I want to download CSV data dump, so that I can load it to my data store and work with it	ODN/P
UC103	Consume version of the data dump valid at certain time	As a data consumer I want to get the data dumps valid at certain time in the past	ODN/P
UC104	Query SPARQL Endpoint	As an advanced data consumer I want to query RDF data directly using SPARQL endpoint	ODN/P
UC105	Use REST API	As a data consumer I want to use REST API, so that I can work with the data from my app	ODN/P
UC106	Browse Data Catalog	As a data consumer I want to browse and search the list of datasets (data catalog)	ODN/C
UC107	Get metadata about dataset	As a data consumer I want to get metadata of the published dataset	ODN/C
UC108	Browse (sample) data	As a data consumer I want to browse data sample to get idea what is in the dataset	ODN/P
UC109	Data dump changes	As a data consumer I want to be notified (for example via RSS or Atom) when a data dump is updated or changed	ODN/P

4.7. Other inputs for Architecture decisions

This is additional list of inputs extending Deliverable 2.1: ‘User requirements for the publication platform from target organizations, including the map of typical environments’.

The system must be extensible with new DPU on the data transformation (ETL) pipelines.
The overhead of managing data transformation (ETL) tasks and managing outputs from/preparing inputs to DPUs must be reasonable - any execution of ETL task must not take more than 200% of the time needed for manual execution of the DPUs’ logic without the use of the ETL framework (under the condition than the pipeline is running alone)
The system must be able to process big files - CSV files containing millions of rows, RDF files containing hundreds of millions of triples.
Response time: The Web management GUI of ODN must respond in 99.9% cases in less than 1s (all components of ODN on one machine, client is connecting to server using 100Mb line at least).

Target platform: Linux, Windows
Preferred languages for the system implementation: Java, others only in case of reuse of existing components with sufficient added-value (for ODN and for ODN users)
The internal format for all data being transformed during the ETL process is RDF, an universal machine readable data format.

Page tree