You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Open Data Node (ODN) is a publication platform which provides to governments, municipalities and other subjects (e.g. companies) an easy way of publishing their data as machine readable (linked) open data.

ODN performs extraction and transformation (conversion, cleansing, anonymization, linking etc.) of data provided by governments and municipalities. ODN stores the result of data transformations in its own storage. ODN publishes the results of data transformations in open and machine readable formats to data consumers (citizens, companies, governments).

 

4.1. Actors

Main actors dealing with the system

  • Data publishers
    government, municipalities, and other subjects providing data

  • Data consumers
    government, municipalities, non-profit organizations (NGOs), citizens (general public), companies (SMEs), application developers consuming the transformed data

  • Data administrators
    administrators, analysts, data curators, etc. (partially) responsible for configuring ODN - usually employees of data publisher

  • Administrators
    IT staff responsible for installation, maintenance and (partially) configuration of ODN - usually employees of data publisher

ODN helps publishers with the complexity of source data and their transformations to open data and deliver easy-to-use and high quality open data to the data consumers.

ODN helps data consumers get the data easily and efficiently in open, machine readable formats.

4.2. Inputs

Input to the system is data of data publishers stored in heterogeneous environments, using wide variety of formats and employing a lot of different technologies to access and process that data

4.3. Outputs

Output of the system are published open data in various forms, as linked data or as tabular data. Also API access to the data is included. Data consumer may be provided with:

  • RDF data as a result of export from the storage

  • RDF data as a result of SPARQL Query

  • CSV data as a result of export from the storage

  • REST API to access data in the storage

All forms of the data shall be available for data consumers under an open license. For more details about the formats of published open data, please see Section 4.3.

 

4.4. Features for data publishers

  • automated and repeatable data harvesting: extraction and transformation (conversion, cleansing, anonymization, etc.) of data, both:

    • initial harvesting of whole datasets (first import)

    • periodical harvesting of incremental updates

  • integration tools for extracting data from publisher’s internal systems (e.g. databases, data in files, information systems with API, etc.)

  • internal storage for data and metadata; the metadata format will be based on DCAT (http://www.w3.org/TR/vocab-dcat/)

  • data publishing in open and machine readable formats to the general public and businesses including automated efficient distribution of updated data and metadata (dataset replication)

  • integration with data catalogs (like CKAN) for automated publication and updating of dataset metadata

  • internal data catalog of datasets for maintenance of dataset metadata

4.5. Features for data consumers

Features for data consumers are discussed separately for different types of data consumers.

4.5.1. Citizen, data analyst, etc.

  • user is typically accessing ODN instance maintained by someone else, user it not running his own instance

  • user may download data dumps and call APIs to get the data which he is interested in

  • data dumps changes are advertised as Atom feeds

  • user may access the data indirectly, for example via 3rd party data catalog, which - in order to show the user preview or visualization of data - has to first download that data (in a similar manner as if user was accessing it: i.e. downloading a dump or accessing an API from ODN instance maintained by someone else)

4.5.2. Aggregator of Open Data (public body, NGO, SME, etc.)

  • the same as in Section 3.5.1

  • aggregator may easily replicate content in another Open Data Node

  • aggregator may automate data integration and linking

4.5.3. Application developer using Open Data (SME, NGO, etc., public body too)

  • the same as in Section 3.5.2

  • application developer has tools for preparing automated generation of API and custom API  of the published datasets

4.5.4. Data administrator

  • possibility to set up the data extraction, transformation, and publication of open data

  • possibility to monitor execution of data extraction and transformatio tasks

  • possibility to debug data extraction and transformation

4.6. Use Cases

In this deliverable, we depict use cases for data publishers and data consumers. Full list of use cases, also with related scenarios and mock-ups can be found at  https://team.eea.sk/wiki/display/COMSODE/Use+Cases (note: This is the internal consortium wiki space mentioned in section “2.1. Methodology” - it is subject to change based on subsequent user requirements and will be most probably moved to public space once community around ODN grows).

For each use case below, we introduce its name, short description and modules in ODN participating on this use case (ODN/M = ODN/Management module, ODN/UV = ODN/UnifiedViews, ODN/P = ODN/Publication module, ODN/IC = ODN/InternalCatalog module, ODN/C = ODN/Catalog module). For full details, please see https://team.eea.sk/wiki/display/COMSODE/Use+Cases .

4.6.1. Use Cases for Data Publisher

Use Cases-Core Dataset Management.png

 

ID

Name

Short Description

Module

UC1

Create dataset record

As a data publisher I want to create new record about the intended published data, so that I can define for every dataset information about the source data, intended transformations and ways how the transformed data should be published

ODN/M
ODN/IC

UC2

Edit/Manage dataset record

As a data publisher I want to edit/manage dataset records

ODN/M
ODN/IC

UC3

Delete dataset record

As a data publisher I want to delete outdated/obsolete dataset record

ODN/M
ODN/IC

UC4

Configure transformation

As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.)

ODN/UV

UC4a

Configure transformation using wizard

As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.) using a wizard, so that it is really simple to prepare typical dataset transformation

ODN/M

UC5

Configure publication

As a data publisher I want to configure how the transformed dataset is published, thus, how the dataset may be consumed by data consumers

ODN/P

UC6

Publish dataset

As a data publisher I want to publish the dataset

ODN/M
ODN/IC
ODN/UV
ODN/P
ODN/C

UC7

Transform dataset

As a data publisher I want to transform the dataset (the dataset is transformed but not published yet)

ODN/UV

UC8

Debug dataset transformation

As a data publisher I want to debug the dataset transformation, see intermediate results of the transformation, see debug message illustrating what happened during the dataset transformation

ODN/UV

UC9

Configure creation of RDF dumps

As a data publisher I want to configure creation of RDF dumps from my published datasets

ODN/P

UC10

Configure creation of CSV dumps

As a data publisher I want to configure creation of CSV dumps from my published datasets

ODN/P

UC11

Configure publishing data via REST API

As a data publisher I want to configure how the REST API is generated on top of my published data, which data is accessible via REST API, which users may use REST API, which methods of accessing my data is available to data consumers

ODN/P

UC12

Configure publishing to SPARQL Endpoint

As a data publisher I want to configure how data consumers may connect to SPARQL endpoints with my published data

ODN/P

UC13

Schedule dataset publication

As a data publisher I want to automate the publication process, so that it can run every week or everytime when new version of dataset is available

ODN/M
ODN/UV
ODN/P

UC14

Schedule dataset transformation

As a data publisher I want to automate the transformation part of the publication process, so that it can run every week or everytime when new version of dataset is available

ODN/M
ODN/UV

UC15

Monitor data publishing tasks

As a data publisher I want to monitor the publishing task to see how the data transformation and data publishing were executed for my datasets

ODN/M

UC16

Basic overview about the transformation pipelines' execution

As a data publisher I want to monitor the publication of the dataset to see whether the publication was OK, or there were some errors

ODN/M
ODN/UV
ODN/P

UC17

Detailed overview about the transformation pipelines' execution

As a data publisher I want to see the detailed overview about the transformations of the dataset

ODN/UV

UC18

Browse transformation logs/events

As a data publisher I want to browse logs and events to see in detail what happened during the dataset transformation

ODN/UV

UC19

Browse intermediate data

As a data publisher I want to browse the intermediate data produced as the dataset is being transformed

ODN/UV

UC20

Overview about the publication of the transformed data

As a data publisher I want to be informed about the publication of the transformed dataset, whether there were some problems or not

ODN/P

UC21

Schedule publishing of transformed RDF data

As a data publisher I want to automate the publishing of the transformed datasets; typical requested behaviour: whenever the dataset is transformed, it should be also published

ODN/P

UC22

Publish transformed dataset

As a data publisher I want to publish the dataset, which has already been transformed

ODN/P

 

 

4.6.2. Use Cases for Data Consumers

Use cases - Data Consumer Use Cases.png



 

ID

Name

Short Description

Module

UC101

Consume RDF data dump

As a data consumer I want to download RDF data dump, so that I can load it to my data store and work with it

ODN/P

UC102

Consume CSV data dump

As a data consumer I want to download CSV data dump, so that I can load it to my data store and work with it

ODN/P

UC103

Consume version of the data dump valid at certain time

As a data consumer I want to get the data dumps valid at certain time in the past

ODN/P

UC104

Query SPARQL Endpoint

As an advanced data consumer I want to query RDF data directly using SPARQL endpoint

ODN/P

UC105

Use REST API

As a data consumer I want to use REST API, so that I can work with the data from my app

ODN/P

UC106

Browse Data Catalog

As a data consumer I want to browse and search the list of datasets (data catalog)

ODN/C

UC107

Get metadata about dataset

As a data consumer I want to get metadata of the published dataset

ODN/C

UC108

Browse (sample) data

As a data consumer I want to browse data sample to get idea what is in the dataset

ODN/P

UC109

Data dump changes

As a data consumer I want to be notified (for example via RSS or Atom) when a data dump is updated or changed

ODN/P

    

 

4.7. Other inputs for Architecture decisions

This is additional list of inputs extending Deliverable 2.1: ‘User requirements for the publication platform from target organizations, including the map of typical environments’.

  • The system must be extensible with new DPU on the data transformation (ETL) pipelines.

  • The overhead of managing data transformation (ETL) tasks and managing outputs from/preparing inputs to DPUs must be reasonable - any execution of ETL task must not take more than 200% of the time needed for manual execution of the DPUs’ logic without the use of the ETL framework (under the condition than the pipeline is running alone)

  • The system must be able to process big files - CSV files containing millions of rows, RDF files containing hundreds of millions of triples.

  • Response time: The Web management GUI of ODN must respond in 99.9% cases in less than 1s (all components of ODN on one machine, client is connecting to server using 100Mb line at least).

  • Target platform: Linux, Windows

  • Preferred languages for the system implementation: Java, others only in case of reuse of existing components with sufficient added-value (for ODN and for ODN users)

  • The internal format for all data being transformed during the ETL process is RDF, an universal machine readable data format.
  • No labels