Implemenetation of project OpenData

This is only a copy

This a a working copy of ENG version of the text. Official version of this document for the public can be found on OpenData.sk portal: http://opendata.sk/liferay/projects

Slovenská verzia tejto Wiki stránky je tu.

Implementation

First step- definition of formats for sharing and selection of appropriate OSS technologies

Proposed solution includes an open source package and deployment methodology, providing to the organization (ministry, municipality, non-profit organization, corporation) a tool, for publication of data from its internal systems. In first phase we publish public information by using open standards. (for example publicly accessible contracts, invoices, accounting data, public notices etc.)

Every organization can choose what kind of data will publish and how often. We believe, that open communication between organizations and the public will create a sensible compromise between secretiveness and openness.

Publishing of digital documents is peaking in the area of catalogization and archiving of literature and cultural works. If we exchange the idea of a scan of a painting from gallery with the idea of a scan of the invoice from a municipality, we can easily apply results of the research and development from the area of culture into the topic of Open Data. That is the reason, why we borrow already defined communication standard - Open Archives Initiative Protocol (OAI-PMH). This protocol is suitable for read-only publishing of informations.

Solution also provides access to the data for third party applications, so that collected and published Open Data can be reused more often and in a better way. Also in this case, it is about using proven and existing concepts from other spheres. (Web 2.0, social networks, ...).

Some data, imported into the solution use specialized, scarcely available data formats. Similarly, open interface is defined as a solution for the development of tools, which transform that unique data into more accessible form.

In this way, it is possible to assemble the implementation of OpenData solution from existing and proven open source technologies and standards and to run it in any, even large Slovak organization (ministries, large companies).

In the solution, also its security assesment and development of recommendations of how to run it in various kinds of organizations are included. Then, reference implementation will be deployed for selected subjects and subsequently it will be tested and certified.

We suppose, that for the needs of OpenData implementation, we are gonna be able to successfully use the potential of existing Slovak open source package - Custodea, that implements the digital publishing in the culture sector.

Solution architecture

How it works

Installed package of (open source) application - Open Data Node - collects documents and metadata about them, then process them and provides them to the public, institutions and to applications.

A document can enter an Open Data Node in a few different ways:
- User submits a document and fills-in a metadata for it using a webintgerface.
- Open Data Node, using a Harvester, monitors defined sources and automatically collects and processes (OCR, metadata extraction, conversion, linking with registries, etc.) published documents.
- Authorized users are verifying, commenting and issuing recommendations for correction or enhancement of the collected documents. In this way, they are thus creating a content of higher quality, with annotations and additional connections (references).

All and any data published by information systems and on web pages of the institutions are the source of the documents (and data in general) for this solution - public administration, civil services, third sector but also selected data from private sector. Majority of currently published invoices from public administration and organizations co-financed by government are published in a form of scanned papper documents - those are candidates for data extraction using OCR - or as reports from accounting which the harvester is able to process directly.

Important sources of data are public registries, such as Business Register, registers of Statistical Office etc., which will ensure the integrity of references in collected documents and their metadata.

Access to the collected data is provided on several levels:
- publishing for further harvesting (OAI-PMH, FTP, HTTP, ...) by other systems (for example other instances of Open Data Node)
- services provided over standardized interfaces such as WebService, REST etc. in various formats (XML, JSON, ...) for the maximum potential of the reuse of the collected Open Data by 3rt party applications
- services with added value provided over standard web internace (browsing)
- mass export (of selected subset) of metadata in selected format (RDF/XML, SKOS, DublinCore)

Presentation of collected data: Open Data Node itself will present the data storing (or registering) in following ways:
- as simple list of "links", which match given giltger (google-like)
- faceted browsing / filtering (using various categories - source of the document, type of the document, date etc)
- display on the map (for documents which are linked to geographical location, like construction permit, company headquarters etc.)
- display on time line (for documents which include a information about data and time, such as invoice due date, signing date of contract etc.)
- table with selected columns, including the ability to export it into CSV (Excel or similar) or XML format

Portal: the list of institutions participating in the initiative needs to be maintained, for a start a simple solution like Wiki or blog will be used, but later on creation of formalized directory will be needed.

Open architecture: it is possible to freely integrate the individual components of the architecture and individual instances of the system (Open Data Node) as well into hierarchies (cascades, ...) thus allowing maximum flexibility in accomodating various types and purposes of the collected data (segmented or sector systems, for example system for heath care data, education, regional data etc.) and also various degrees of (pre-)processing of input input data - OCR, extraction, format, size, validation, limits.

Illustation of multiple Open Data Nodes connected into hierarchy

Space shortcuts

Child pages

Implementation

First step- definition of formats for sharing and selection of appropriate OSS technologies

Solution architecture

How it works