Big Picture
To better understand the issues and the structure of the problem we outline the overall picture of solutions.
Component Model
Component name | Purpose | Note |
---|---|---|
Data Harvester | Automatic and semi-automatic data updates | Various protocols and techniques: OAI-PMH, web download, offline files, user-upload |
Open Data Enhancer | Processing the input data: OCR, automatic metadata extraction, classification, Contextualisation and enrichment | |
Open Data Repository | Centralized data storage, metadata and indices for open data | |
Open Data Services & API | Data services for third-party applications | |
Portal & Applications | Portal as an organizational and technical tool for community | The basic functionality for search and data presentation open documents |
Data Harvester
The main task of this component is to feed raw system data (documents) on a regular basis - adding new and updating old.
- collects data from defined sources
- supports protocols such as OAI-PMH, HTTP / HTTPS, FTP, file-system
- supports all common video and text formats (JPG, TIFF, TXT, PNG, PDF, DOC, HTML, XML, ...)
- does basic validation, transformation and conversion to mediation (exchange, labor) formats
- define (add) technical and organizational metadata - source, time of harvest, the producer, format, size, ...
- scheduler to schedule (regular) data collection for each source
- allows a one-time, immediate collection
Open Data Enhancer
Component obtain of the input document the maximum amount of relevant information (metadata) and evaluate their quality (completeness, accuracy, relevance, ...)
- in the case of non-text document attempts to recognize text part (subcomponent OCR)
- classified according to the type of document (content), for example. contract, invoice, ...
- in the text document (also obtained via OCR) recognizes the fundamental (content) attributes of a document (by document type) - a "text mining" and "text recognition" respectively. occupancy documents
- using the "spellchecker" and defined dictionaries (registers authorities) to complete or correct metadata
Open Data Repository
Central repository for data, metadata (content, technical and operational) and indices (for full-text and faceted search). It consists of
- data storage (repository) - saved a document or link directly to it
- metadata repository (repository), eg. RDF format
- indexing server (support for fulltext, veneers, sorting, filtering)
- image server - as needed for any storage and delivery of large image data
It also allows
- allocation and management of unique and persistent ID documents for maintaining identity, addressing duplication and facilitate internal and external links between documents.
- management & monitoring, ie. management and maintenance of data-level system administrator
Open Data Services & API
Services and open interfaces to support third-party applications, create mashups etc. over all the processed data in the system.
Portal & Application
Web user interface, its main tasks are:
- basic information about Open Data activity
- make available all documents processed at a basic level:
- find documents - browse on, fulltext, filtering, searching by metadata
- view documents (and metadata) in a suitable view - list, detail, table (later map, timeline, graph, ...)
- community and collaboration tool for all stakeholders (data providers, application providers, application users, donors, the general public, ...)
- provide the basis of the lists and statistics - sources and their quality, contributors, trends, most requested / most discussed documents, the most active group ...
- allows for feedback on the documents and resources, including the system for repairs and complete metadata (eg crowd sourcing)