Open Data Node (ODN) is a publication platform which provides to governments, municipalities and other subjects (e.g. companies) an easy way of publishing their data as machine readable (linked) open data.
ODN performs extraction and transformation (conversion, cleansing, anonymization, linking etc.) of data provided by governments and municipalities. ODN stores the result of data transformations in its own storage. ODN publishes the results of data transformations in open and machine readable formats to data consumers (citizens, companies, governments).
4.1. Actors
Main actors dealing with the system
Data publishers
government, municipalities, and other subjects providing dataData consumers
government, municipalities, non-profit organizations (NGOs), citizens (general public), companies (SMEs), application developers consuming the transformed dataData administrators
administrators, analysts, data curators, etc. (partially) responsible for configuring ODN - usually employees of data publisherAdministrators
IT staff responsible for installation, maintenance and (partially) configuration of ODN - usually employees of data publisher
ODN helps publishers with the complexity of source data and their transformations to open data and deliver easy-to-use and high quality open data to the data consumers.
ODN helps data consumers get the data easily and efficiently in open, machine readable formats.
4.2. Inputs
Input to the system is data of data publishers stored in heterogeneous environments, using wide variety of formats and employing a lot of different technologies to access and process that data
4.3. Outputs
Output of the system are published open data in various forms, as linked data or as tabular data. Also API access to the data is included. Data consumer may be provided with:
RDF data as a result of export from the storage
RDF data as a result of SPARQL Query
CSV data as a result of export from the storage
REST API to access data in the storage
All forms of the data shall be available for data consumers under an open license. For more details about the formats of published open data, please see Section 4.3.
4.4. Features for data publishers
automated and repeatable data harvesting: extraction and transformation (conversion, cleansing, anonymization, etc.) of data, both:
initial harvesting of whole datasets (first import)
periodical harvesting of incremental updates
integration tools for extracting data from publisher’s internal systems (e.g. databases, data in files, information systems with API, etc.)
internal storage for data and metadata; the metadata format will be based on DCAT (http://www.w3.org/TR/vocab-dcat/)
data publishing in open and machine readable formats to the general public and businesses including automated efficient distribution of updated data and metadata (dataset replication)
integration with data catalogs (like CKAN) for automated publication and updating of dataset metadata
internal data catalog of datasets for maintenance of dataset metadata
4.5. Features for data consumers
Features for data consumers are discussed separately for different types of data consumers.
4.5.1. Citizen, data analyst, etc.
user is typically accessing ODN instance maintained by someone else, user it not running his own instance
user may download data dumps and call APIs to get the data which he is interested in
data dumps changes are advertised as Atom feeds
user may access the data indirectly, for example via 3rd party data catalog, which - in order to show the user preview or visualization of data - has to first download that data (in a similar manner as if user was accessing it: i.e. downloading a dump or accessing an API from ODN instance maintained by someone else)
4.5.2. Aggregator of Open Data (public body, NGO, SME, etc.)
the same as in Section 3.5.1
aggregator may easily replicate content in another Open Data Node
aggregator may automate data integration and linking
4.5.3. Application developer using Open Data (SME, NGO, etc., public body too)
the same as in Section 3.5.2
application developer has tools for preparing automated generation of API and custom API of the published datasets
4.5.4. Data administrator
possibility to set up the data extraction, transformation, and publication of open data
possibility to monitor execution of data extraction and transformatio tasks
possibility to debug data extraction and transformation
4.6. Use Cases
In this deliverable, we depict use cases for data publishers and data consumers. Full list of use cases, also with related scenarios and mock-ups can be found at https://team.eea.sk/wiki/display/COMSODE/Use+Cases (note: This is the internal consortium wiki space mentioned in section “2.1. Methodology” - it is subject to change based on subsequent user requirements and will be most probably moved to public space once community around ODN grows).
For each use case below, we introduce its name, short description and modules in ODN participating on this use case (ODN/M = ODN/Management module, ODN/UV = ODN/UnifiedViews, ODN/P = ODN/Publication module, ODN/IC = ODN/InternalCatalog module, ODN/C = ODN/PublicCatalog module). For full details, please see https://team.eea.sk/wiki/display/COMSODE/Use+Cases .
4.6.1. Use Cases for Data Publisher
ID | Name | Short Description | Module |
UC1 | Create dataset record | As a data publisher I want to create new record about the intended published data, so that I can define for every dataset information about the source data, intended transformations and ways how the transformed data should be published | ODN/M |
UC2 | Edit/Manage dataset record | As a data publisher I want to edit/manage dataset records | ODN/M |
UC3 | Delete dataset record | As a data publisher I want to delete outdated/obsolete dataset record | ODN/M |
UC4 | Configure transformation | As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.) | ODN/UV |
UC4a | Configure transformation using wizard | As a data publisher I want to configure dataset transformation (cleansing, linking, enrichment, quality assessment, etc.) using a wizard, so that it is really simple to prepare typical dataset transformation | ODN/M |
UC5 | Configure publication | As a data publisher I want to configure how the transformed dataset is published, thus, how the dataset may be consumed by data consumers | ODN/P |
UC6 | Publish dataset | As a data publisher I want to publish the dataset | ODN/M |
UC7 | Transform dataset | As a data publisher I want to transform the dataset (the dataset is transformed but not published yet) | ODN/UV |
UC8 | Debug dataset transformation | As a data publisher I want to debug the dataset transformation, see intermediate results of the transformation, see debug message illustrating what happened during the dataset transformation | ODN/UV |
UC9 | Configure creation of RDF dumps | As a data publisher I want to configure creation of RDF dumps from my published datasets | ODN/P |
UC10 | Configure creation of CSV dumps | As a data publisher I want to configure creation of CSV dumps from my published datasets | ODN/P |
UC11 | Configure publishing data via REST API | As a data publisher I want to configure how the REST API is generated on top of my published data, which data is accessible via REST API, which users may use REST API, which methods of accessing my data is available to data consumers | ODN/P |
UC12 | Configure publishing to SPARQL Endpoint | As a data publisher I want to configure how data consumers may connect to SPARQL endpoints with my published data | ODN/P |
UC13 | Schedule dataset publication | As a data publisher I want to automate the publication process, so that it can run every week or everytime when new version of dataset is available | ODN/M |
UC14 | Schedule dataset transformation | As a data publisher I want to automate the transformation part of the publication process, so that it can run every week or everytime when new version of dataset is available | ODN/M |
UC15 | Monitor data publishing tasks | As a data publisher I want to monitor the publishing task to see how the data transformation and data publishing were executed for my datasets | ODN/M |
UC16 | Basic overview about the transformation pipelines' execution | As a data publisher I want to monitor the publication of the dataset to see whether the publication was OK, or there were some errors | ODN/M |
UC17 | Detailed overview about the transformation pipelines' execution | As a data publisher I want to see the detailed overview about the transformations of the dataset | ODN/UV |
UC18 | Browse transformation logs/events | As a data publisher I want to browse logs and events to see in detail what happened during the dataset transformation | ODN/UV |
UC19 | Browse intermediate data | As a data publisher I want to browse the intermediate data produced as the dataset is being transformed | ODN/UV |
UC20 | Overview about the publication of the transformed data | As a data publisher I want to be informed about the publication of the transformed dataset, whether there were some problems or not | ODN/P |
UC21 | Schedule publishing of transformed RDF data | As a data publisher I want to automate the publishing of the transformed datasets; typical requested behaviour: whenever the dataset is transformed, it should be also published | ODN/P |
UC22 | Publish transformed dataset | As a data publisher I want to publish the dataset, which has already been transformed | ODN/P |
4.6.2. Use Cases for Data Consumers
ID | Name | Short Description | Module |
UC101 | Consume RDF data dump | As a data consumer I want to download RDF data dump, so that I can load it to my data store and work with it | ODN/P |
UC102 | Consume CSV data dump | As a data consumer I want to download CSV data dump, so that I can load it to my data store and work with it | ODN/P |
UC103 | Consume version of the data dump valid at certain time | As a data consumer I want to get the data dumps valid at certain time in the past | ODN/P |
UC104 | Query SPARQL Endpoint | As an advanced data consumer I want to query RDF data directly using SPARQL endpoint | ODN/P |
UC105 | Use REST API | As a data consumer I want to use REST API, so that I can work with the data from my app | ODN/P |
UC106 | Browse Data Catalog | As a data consumer I want to browse and search the list of datasets (data catalog) | ODN/C |
UC107 | Get metadata about dataset | As a data consumer I want to get metadata of the published dataset | ODN/C |
UC108 | Browse (sample) data | As a data consumer I want to browse data sample to get idea what is in the dataset | ODN/P |
UC109 | Data dump changes | As a data consumer I want to be notified (for example via RSS or Atom) when a data dump is updated or changed | ODN/P |
4.7. Other inputs for Architecture decisions
This is additional list of inputs extending Deliverable 2.1: ‘User requirements for the publication platform from target organizations, including the map of typical environments’.
The system must be extensible with new DPU on the data transformation (ETL) pipelines.
The overhead of managing data transformation (ETL) tasks and managing outputs from/preparing inputs to DPUs must be reasonable - any execution of ETL task must not take more than 200% of the time needed for manual execution of the DPUs’ logic without the use of the ETL framework (under the condition than the pipeline is running alone)
The system must be able to process big files - CSV files containing millions of rows, RDF files containing hundreds of millions of triples.
Response time: The Web management GUI of ODN must respond in 99.9% cases in less than 1s (all components of ODN on one machine, client is connecting to server using 100Mb line at least).
Target platform: Linux, Windows
Preferred languages for the system implementation: Java, others only in case of reuse of existing components with sufficient added-value (for ODN and for ODN users)
- The internal format for all data being transformed during the ETL process is RDF, an universal machine readable data format.