Ut, 14 jún 2011 17:47:00 +0200
Context and Background
The aim was to design a simple and autonomous system which collects, manages, preserves, presents and provides documents (and metadata to them) - all on a neutral basis technology, on the principle of open architecture and the use of open standards for the great and easy usability.
Study is now being implemented as part of research FP7 projekc COMSODE and current information about Open Data Node can be found at http://opendatanode.org/ .
Big Picture
To better understand the issues and the structure of the problem we outline the overall picture of solutions.
Component Model
Component name | Purpose | Note |
Data Harvester | Automatic and semi-automatic data updates | Various protocols and techniques: OAI-PMH, web download, offline files, user-upload |
Open Data Enhancer | Processing the input data: OCR, automatic metadata extraction, classification, Contextualisation and enrichment | |
Open Data Repository | Centralized data storage, metadata and indices for open data | |
Open Data Services & API | Data services for third-party applications | |
Portal & Applications | Portal as an organizational and technical tool for community | The basic functionality for search and data presentation open documents |
Data Harvester
The main task of this component is to feed raw system data (documents) on a regular basis - adding new and updating old.
- collects data from defined sources
- supports protocols such as OAI-PMH, HTTP / HTTPS, FTP, file-system
- supports all common video and text formats (JPG, TIFF, TXT, PNG, PDF, DOC, HTML, XML, ...)
- does basic validation, transformation and conversion to mediation (exchange, labor) formats
- define (add) technical and organizational metadata - source, time of harvest, the producer, format, size, ...
- scheduler to schedule (regular) data collection for each source
- allows a one-time, immediate collection
Open Data Enhancer
Component obtain of the input document the maximum amount of relevant information (metadata) and evaluate their quality (completeness, accuracy, relevance, ...)
- in the case of non-text document attempts to recognize text part (subcomponent OCR)
- classified according to the type of document (content), for example. contract, invoice, ...
- in the text document (also obtained via OCR) recognizes the fundamental (content) attributes of a document (by document type) - a "text mining" and "text recognition" respectively. occupancy documents
- using the "spellchecker" and defined dictionaries (registers authorities) to complete or correct metadata
Open Data Repository
Central repository for data, metadata (content, technical and operational) and indices (for full-text and faceted search). It consists of
- data storage (repository) - saved a document or link directly to it
- metadata repository (repository), eg. RDF format
- indexing server (support for fulltext, veneers, sorting, filtering)
- image server - as needed for any storage and delivery of large image data
It also allows
- allocation and management of unique and persistent ID documents for maintaining identity, addressing duplication and facilitate internal and external links between documents.
- management & monitoring, ie. management and maintenance of data-level system administrator
Open Data Services & API
Services and open interfaces to support third-party applications, create mashups etc. over all the processed data in the system.
Portal & Application
Web user interface, its main tasks are:
- basic information about Open Data activity
- make available all documents processed at a basic level:
- find documents - browse on, fulltext, filtering, searching by metadata
- view documents (and metadata) in a suitable view - list, detail, table (later map, timeline, graph, ...)
- community and collaboration tool for all stakeholders (data providers, application providers, application users, donors, the general public, ...)
- provide the basis of the lists and statistics - sources and their quality, contributors, trends, most requested / most discussed documents, the most active group ...
- allows for feedback on the documents and resources, including the system for repairs and complete metadata (eg crowd sourcing)
Links
Automatic extraction of metadata from published documents