DELIVERABLE D5.1
Methodology for publishing datasets as open data
Documentation of Practices
Project | Components Supporting the Open Data Exploitation |
Acronym | COMSODE |
Contract Number | FP7-ICT-611358 |
Start date of the project | 1st October 2013 |
Duration | 24 months, until 31st September 2015 |
|
|
Date of preparation | 30. 7. 2014 |
Author(s) | Martin Nečaský, Dušan Chlapek, Jakub Klímek, Jan Kučera, Andrea Maurino, Anisa Rula |
Responsible of the deliverable | Martin Nečaský |
Reviewed by |
|
Status of the Document | Final |
Version | 1.0 |
Dissemination level | PP Restricted to other programme participants (including the Commission Services) |
|
|
About this document
P01A01 Analysis of data sources
Development of open data publication plan starts with analysing data sources managed by the organization and identifying potential data sets contained in the data sources. It is possible to skip this task when the organization has already decided that only particular datasets will be opened and it is clear where, by who and how the datasets are managed in the organization. Otherwise, we recommend to proceed in the following steps:
- Analyse organizational structure of the organization, its regulations and agendas. Identify activities of particular organization units which are related to collecting, creating or management of some data which are potential datasets. Record the identified datasets into the list of datasets.
- Analyse annual reports of the organization and other public documents (including web portals of the organization) which inform about the activities and results achieved by the organization. Identify tables and graphs in the documents which inform about potential datasets. Find out which organization units prepared them. Record the new identified datasets into the list of datasets.
- Identify information systems in the organization. Identify potential datasets managed by the systems. Record the identified datasets into the list of datasets.
- Analyse requests for information sent to the organization by the public. Try to identify datasets which are interesting for the public and compare them with the previously identified datasets. Record the new identified datasets into the list of datasets.
- For each identified dataset, record the following information:
○ title and description
○ responsible organization unit
○ contact person (for consultations about the dataset)
○ current form of the dataset (stored in a relational or other database system; stored as a tabular file or files, in XLS(X), ODS, etc.; XML format; CSV format; proprietary tabular format; only in a non-structured or semi-structured textual form) and a brief description of the form
- Create a map which shows organization units and datasets they are responsible for. The map should be graphically represented.
- Discuss the map and the list of identified datasets with relevant contact persons.
P01A02 Identification of datasets for opening up
The map of available data sources, datasets and responsible organizational units developed in the previous task serves as an input into identification of suitable datasets for opening up. The goal of this task is to decide which of the candidate datasets should be published as open data and identify datasets that cannot be opened up.
We recommend discussing the datasets and reasons for their opening up with persons responsible for the datasets and other relevant stakeholders in a form of a workshop. Alongside this internal discussion, engagement with the potential users might provide information about their preferences. User engagement in this phase will help to establish demand oriented approach to open data publication which in turn might help to mitigate the risk of consumers absence (see the task CA03A01). User engagement during the development of open data publication plan might be also a part of activities of identification of the potential user groups (CA02A01).
For each of the datasets it is necessary to document reasons for their publication as open data or reasons why it cannot be published. If some of the datasets is not going to be published as open data such decision must be made after analysis of the barriers preventing its publication.
Motivations for opening up data will probably differ from case to case, however there are some existing studies that discuss common motivations for open data. For example an international benchmark performed by Logica Business Consulting (2012) lists cites the following motivations:
- Increase transparency;
- Stimulate economic growth;
- Improve government services and responsiveness;
- Encourage reuse;
- Improve public relations and attitudes toward government;
- Improve government data and processes.
What datasets should be published as open data might be also determined by for example:
- strategic vision or strategic decision of the dataset owner and his or her objectives - strategic objectives might determine what are the primary datasets or primary domains for open data;
- legislation - legislation might determine what datasets should be published by the public sector bodies for reuse, directive 2013/37/EU amending Directive 2003/98/EC on the re-use of public sector information and the INSPIRE directive 2007/2/EC are relevant directives that govern reuse of public sector information and geodata;
- national and other strategies relevant to open data.
Datasets that are in line with the above mentioned motivations for open data, whose publication helps to achieve the strategic goals of the data owner or datasets that should be published as open data according to the legislation or some relevant strategy represent good candidates for open datasets. Budgetary data, data about purchases and public contracts might help to increase transparency. Data about actions and performance of the public sector bodies might help to build trust and improve attitudes toward government.
In order to stimulate economic growth and to encourage reuse high value datasets that are likely to be reused should be published. Transport data that can be reused in applications helping people to utilize public transport is example of such data (Open Data Institute, 201?a).
According to (Logica Business Consulting, 2012) data that help citizens to better locate and utilize public services should be published in order to improve government services and responsiveness. Data describing the public services and providing information about who provides them and what are the conditions under which the services can be utilized are examples of data in this category. Another example might be life events data[1].
There are also studies that assess availability of certain datasets across different countries, e.g. the Open Data Index[2] or the Open Data Barometer[3]. Datasets assessed in these studies are usually in demand by the re-users. If you are a public sector body that owns or maintains any of the datasets named in these studies you might consider opening up such datasets.
If some dataset is considered to be too complex or too costly for opening up on 3* and higher level (see the task P01A03) you should consider some lower target openness level (2*) which might require less effort.
Sometimes a dataset might be already published elsewhere and it might be even a part of some larger dataset. For example in the Czech Republic the Ministry of Finance publishes budgetary data for every public sector body in the Czech Republic as open data[4]. Public sector bodies in the Czech Republic might refer to the datasets of the Ministry of Finance instead of publishing their own datasets. Therefore every candidate dataset should be checked that it is not duplicate to some already published dataset.
If only a part of a dataset is published elsewhere you should analyse whether it is reasonable to split the dataset and publish just your primary data linked to the external dataset. The necessary links might be provided in the catalogue record or if the Linked Data principles are applied (5* open data) it is possible to interlink objects in the datasets directly.
P01A03 Determination of target level of openness
- Determine the target level of openness according to 1*-5* schema for each dataset. The minimal level is 3*.
○ It is possible to determine 2* level in special cases when the dataset exists only in a form of non-structured documents and it is not possible for the organization to convert them to a structured form.
- Choose one or more data formats for publishing each dataset on the base of the determined level of openness:
○ For 2* level, we recommend formats of tabular editors (ODS, XLS(X), etc.) or HTML. In case of textual documents (e.g. public contracts agreements), it is possible to choose formats of text editors (ODT, DOC(X), etc.).
○ For 3* level, we recommend CSV, XML or JSON.
○ For 4* level, it is possible to choose RDF which is also a recommended format for 5* level. It is also possible to choose CSV, XML or JSON. The key characteristics of 4* level in comparison to 3* level is that entities in the dataset are identified by URLs so that it is possible to transparently reference them from other datasets. P02A03-04 and P02A03-05 provide recommendations for designing the URLs. We recommend to record the URLs in CSV, XML and JSON formats in the following way:
■ For CSV, we recommend to add a new column for entity URLs. The column should be placed besides the already existing column for entity identifiers[5].
■ For XML, we recommend to record URLs using an extension of HTML and XML documents called RDFa[6] (XML attribute resource).
■ For JSON, we recommend to record URLs using an extension of JSON called JSON-LD[7] (construction @id).
○ For 5* level, i.e. linked open data represented in RDF model, we recommend TTL[8]. The key characteristics of 5* level in comparison to 4* level is that we do not provide only entity identifiers in the form of URLs but we also link them to URLs of other related entities in other datasets. It is also necessary to ensure that each URL of our published entities is dereferenceable, i.e. that a client application receives a machine readable representation of the entity in RDF model when it accesses the URL.
- Determine the target periodicity for publishing updates of the dataset.
○ Updates do not have to be published as often as changes to the source data appear. For example, even though the source data are changed each hour, we can determine that the updates will be published weekly. In other words, the target periodicity is one week. The closer the target periodicity is to the frequency of changes the higher quality of the dataset is achieved. On the other hand, the higher target periodicity can result in higher costs.
○ It is possible to reflect the feedback from the public (see task CA02A08) to identify expected target periodicity or to verify that the determined periodicity is sufficient for the potential users of the dataset.
- Decide how the dataset will be published. Basically, a data file for each chosen data format should be published. It should contain items which are valid in the time of publication of the data file.
○ If the dataset is too big, it can be split to more data files.
- Decide if historical versions of the dataset will be published. If so, decide how many versions back to the history will be published.
- If it is technically possible it is recommended to publish the list of changes made into the dataset between the current and previous version of the dataset.
- If it is technically and financially possible it is recommended to publish not only data files but also an API which enables direct access to particular items (entities) of the dataset:
○ For 3* and 4*, an API should be a REST service.
○ For 5*, an API should be a SPARQL endpoint.
P01A04 Effort estimation
Publication of datasets will take some effort. If you take a look at the tasks in the phases 2-4 of the open data publication process described in this methodology you can see that there is a number of tasks that must be done. In order to be able to assess the costs of publication of some particular dataset, effort required for its publication needs to be estimated.
Because there might be significant difference between datasets we recommend to make effort estimation for each individual candidate dataset. This approach also allows comparison of required effort between datasets.
According to (Kučera, 2014) effort required for publication of datasets might be affected by the following factors:
- dataset complexity,
- anonymization,
- manual operations performed with the dataset,
- dataset size,
- target periodicity of publication.
In this context the dataset complexity expresses how difficult it would be to transform the dataset from its current format into the target format. Less changes to the dataset are required to achieve the target format, less effort it will take to publish the dataset as open data.
If it is necessary to anonymise the datasets before it is published or if some manual operations are required the effort will probably increase. In such case the amount of effort might be also affected by the dataset size. Even if the anonymisation is automated it might be necessary to validate the results of the anonymisation in order to mitigate the risk of privacy infringement (see the task CA03A01 Identification and analysis of the potential risks). Effort required to perform such validation increases with the size of the dataset in case that it involves manual operations.
Target periodicity of publication affects the total estimated effort for a given period of time. Target periodicity and the length determine how many times the tasks required to prepare and publish a dataset will be repeated. Estimated effort of these tasks is therefore multiplied by the number of times they are repeated during the given period. However, in some situations estimated effort for some tasks might decrease over time or it might be lower during maintenance of a dataset than during the initial publication because increase in efficiency of the operations might be expected. In order to reflect this efficiency improvements coefficients can be added into the calculation.
Estimated effort might be expressed as an estimated number of person-hours or person-months. However it is also possible to just rate the datasets publication effort using a given scale. For example Eibl et al. (2013) propose the following scale for rating the estimated dataset publication effort (costs):
- unjustifiable cost,
- very high cost,
- high cost,
- medium cost,
- low cost,
- very low.
We can sum up that the estimation of effort should be performed in the following steps:
- Determine the period for which the effort is going to be estimated, e.g. 1 year.
- Estimate the effort of preparation of publication of a dataset - effort required to perform tasks not directly related to one particular dataset but to a set of datasets, like selection and implementation of the software tools, might be distributed among the datasets that should be opened up.
- Estimate the effort of realization of publication - target periodicity of publication affects the results of this step.
- Estimate the effort of termination of maintenance/publication of a dataset - mainly the effort required to make the transition into the dataset archive or to make the dataset no longer available should be estimated. After the maintenance or publication of the dataset is terminated the dataset should consume very little or no effort.
- Estimate the effort of the cross-cutting activities. Respective portions of effort might be distributed among the datasets that should be opened up if necessary.
If the total amount of effort for the whole open data initiative is calculated, effort of the preparation of publication phase should be added to the total estimated effort of publication of all selected datasets.
P01A05 Definition of the open data publication plan
Open data publication plan represents a final output of the first (analytical) phase of the open data publication process. Open data publication goals are documented in this plan which should be formulated in line with the strategic goals of the owner of the data and they should answer the question “why should I publish open data?” Some of the possible motivations for opening up data are listed in description of practices of the task (P01A02) Identification of datasets for opening up. However the list of motivation in that section is not comprehensive.
In case of the public sector bodies national open data policy might affect the development of the open data publication plan and formulation of the open data publication goals because these policies might determine what datasets should be published as open data. US Open Data Policy (Executive Office of the President, 2013), National Information Infrastructure in the UK (Cabinet Office, 2013) or the Open Government Data Germany (Klessmann a kol. 2012) are examples of national policies and studies related to open data. International Open Government Partnership initiative might also be relevant when open data publication goals are formulated because the national action plans might also set priorities for open data and datasets for publication (see for example Action plan of the Czech Republic (Czech Republic, 2012) or the Slovak Republic (Slovak Republic, 2012)).
Information about the potential open datasets acquired during the previous tasks are compiled into the final catalogue of candidate datasets for opening up. Prioritization of the datasets should be performed and the open data publication roadmap should be set according to the set priorities. Finally the open data publication plan should contain description of roles of stakeholders involved in publication of open data and responsibilities of these roles should be set and documented in the open data publication plan.
Definition of the open data publication plan should be coordinated with definition of the benefits management plan (CA04A02) as well as the risk mitigation plan (CA03A02). Benefits, risks and publication effort should be balanced. In order to balance these factors it is necessary to analyse benefits, risk and effort as well. Therefore definition of these plans should not be performed separately from each other but their definition should be aligned and harmonized.
Unless the (AR35) Communication strategy, (AR43) Risk mitigation plan and the (AR48) Benefits management plan are developed and documented separately they can be included in the open data publication plan.
P01A05-01 Prioritization of datasets
If it is not possible to publish all the candidate datasets at once, publication of the candidate datasets should be spread into several releases according to the set priorities. According to (Kučera, 2012) prioritization of datasets for opening up might involve the following steps:
- Determine what key dataset attributes will be used as prioritization criteria.
- Determine weights of the prioritization criteria.
- Remove datasets that cannot be published as open data from the catalogue of candidate datasets.
- Calculate priorities based on the prioritization criteria and their weight. Sort the candidate datasets by the calculated priorities.
- Move datasets up or down the list if you feel that the calculated scores does not reflect the open data publication goals or the true value of the datasets. Alternatively you can adjust the weights of the prioritization criteria and recalculate the scores.
- Final selection of the datasets for opening up and distribution of the selected datasets to the planned releases.
Kučera (2012) suggests the following prioritization criteria (the list is not comprehensive):
- Demand for data - demand can be determined for example by a survey or by analysis of the FOI requests (frequent FOI requests on certain topic might indicated demand for certain datasets).
- Current formats - publication of datasets that are already available in machine readable formats is probably less costly than publication of datasets that require non-trivial transformation of data. Therefore selecting already machine readable datasets might allow to start the open data initiative quickly.
- Availability of the schema documentation - documentation of the datasets schema might make the reuse easier.
- Benefits - datasets with higher expected benefits should be preferred.
- Risks - datasets that are less risky to publish should be preferred.
- Estimated effort - datasets that are easy to publish might be preferred, especially if they have a high value.
If the dataset should be published as Linked Data the following criteria might be applied for prioritization (Nečaský et al., 2014):
- Identifiers/keys - datasets with natural keys are the best candidates for linked datasets because the identification of such entity does not need arbitrary agreement between parties about how to identify that entity.
- Linking potential - estimation of number of datasets that can be linked to the assessed dataset.
P01A05-02 Development of the open data publication roadmap
According to the set priorities of the candidate datasets a schedule for their publication should be determined. However the roadmap should not contain only the planned releases of datasets but it should contain the schedule of other tasks that must be performed in order to be able to publish the datasets. I.e. the tasks of the preparation of publication phase like P02A04.
P02A01 Data sources access configuration
This methodology recommends to automate the process of publication of identified datasets. Any software tool which enables the automation needs an access to the sources of the datasets in the organization. Such sources are usually database servers underlying the identified information systems or simply (tabular) files stored somewhere in the file system of the organization. It is therefore necessary to ensure that the tool or tools used for automation can access those sources. Since the tool ensures publication, read-only access is sufficient and recommended for security reasons.
- In case of a database server as a data source, it is necessary to create a read-only user account which enables the software tool to read with the required data. Another option is to prepare database scripts which periodically dump the tables to machine readable data files (CSV, XML or JSON).
- In case of data files as a data source (even data files created directly by people in the organization or exported from a database), it is necessary to copy the data files to a place in a file system where the software tool can read them locally. It is also possible to make the data files readable remotely through FTP or other network protocols.
P02A02 Definition of the catalogue record schema and the target data catalogues
P02A02-01 Definition of the catalogue record schema
- It is recommended to define the catalogue record schema according to the W3C vocabulary DCAT[9].
- In case of 5* data, it is recommended to extend the catalogue record schema with the W3C vocabulary VoID[10].
- If it is necessary to record metadata about the dataset provenance, it is recommended to extend the catalogue record schema with W3C vocabulary PROV Data Model[11]. In particular, it is recommended to exploit PROV Ontology (PROV-O)[12]. It enables to describe the process of data creation including information about who is responsible for particular steps of the creation process and inputs/outputs of those steps.
- The organization should maintain its catalogue records in its internal data catalogue which offers a user interface for editing catalogue records structured according to DCAT (resp. VoID) and enables to export them as Linked Open Data (i.e. as 5* data) expressed in TTL and JSON-LD formats.
○ TTL format is a commonly used format for publishing Linked Open Data.
○ JSON-LD is also one of the formats for publishing Linked Open Data. It enables to process the catalogue records as JSON documents. It is useful for programmers who are able to process JSON but do not know how to work with Linked Open Data.
- As a part of catalogue records it is also necessary to record metadata for particular data files which form the content of a dataset. They are called distributions. DCAT and VoID provide constructs for specification of this kind of metadata.
- It is also recommended to use DCAT to describe metadata about the public local data catalogue of the organization (see P02A02-02).
When DCAT is applied, the catalogue records (but not the data files themselves) are prepared for publication as 5* open data. In other words, catalogue records are prepared for publication as Linked Open Data.
The correct application of DCAT and VoID vocabularies is a relatively complex problem. It is recommended to proceed according to detailed specifications of those vocabularies published by W3C:
- DCAT - latest version of the vocabulary and its usage guide is available at http://www.w3.org/TR/vocab-dcat/
- VoID - latest version of the vocabulary and its usage guide is available at http://www.w3.org/TR/void/
It is also necessary to design URLs of each catalogue record and other related entities. Those URLs are used for the identification of those entities according to the Linked Open Data principles. Chapter P02A02-03 is dedicated to design patterns for the URLs.
Last but not least, let us note that EU initiative ISA (Interoperability Solutions for European Public Administrations) created a DCAT profile which specifies how European datasets should be published: DCAT application profile for data portals in Europe (DCAT-AP). DCAT-AP extends DCAT vocabulary and distinguishes mandatory and optional classes and properties. It recommends vocabularies for specification licence types. The recommendations in this Methodology are based on DCAT. However, they are applicable even the DCAT-AP is applied.
P02A02-02 Recommendations for local public data catalogue
The organization can decide if it will run own public data catalogue on its web portal. In case there is a national central data catalogue in the country of the organization, the organization should use it to publish catalogue records about its datasets. If no central catalogue exists, the organization should run its own. We will call it local public data catalogue of the organization. This chapter contains recommendations for running local public data catalogue of the organization (referred as catalogue in the rest of this chapter). It is recommended to choose one of the following forms of the catalogue:
- Primitive data catalogue is a simple HTML page which informs the public that the organization publishes its data as open data. It must also enable to download a data file with exported catalogue records in a machine readable notation (see P02A02-01).
- Basic data catalogue extends the simple HTML page of the primitive data catalogue with a list of links to HTML pages dedicated to particular datasets published by the organization. Each HTML page describes in a human readable form the metadata from the catalogue record of the dataset. It provides links for downloading individual distributions (resp. data files forming the distributions) of the dataset. It also enables to download the catalogue record of the dataset in a machine readable notation.
- Full data catalogue extends the basic data catalogue with search and other functions (e.g., previews of dataset distributions, user discussions, etc.).
It is recommended that the homepage of the catalogue has a URL based on the following pattern:
http://data.{domain2}.{domain1}
where {domain1} is the top level domain of the organization and {domain2} is the second-level domain of the organization. We will call this URL catalogue URL and denote it {catalogue-URL}. E.g., the URL of the homepage of the catalogue of Czech Trade Inspection Authority would be
It is also recommended that URL where the data file with all catalogue records in a machine readable format TTL, resp. JSON can be downloaded is based on the following pattern:
{catalogue-URL}/metadata-dump.ttl
resp.
{catalogue-URL}/metadata-dump.json
For the catalogue of Czech Trade Inspection Authority, the URLs would be:
http://data.coi.cz/metadata-dump.ttl
resp.
http://data.coi.cz/metadata-dump.json
URLs for other HTML pages in the catalogue, e.g. HTML pages for individual datasets, are not prescribed by this methodology.
P02A02-03 Convention for URLs of catalogue entities
The recommendation P02A02-01 specifies that catalogue records schema should be based on DCAT. DCAT applies the Linked Open Data principles. Therefore, catalogue of the organization, datasets recorded in the catalogue and their particular distributions are understood as entities which must be identified by their URLs. According to the Linked Open Data principles, entity URLs are important because of the following reasons:
- An entity (catalogue, dataset, distribution) is uniquely identified by its URL.
- The URL of the catalogue is used to specify metadata about the catalogue in a machine readable representation. The URLs of the datasets and their distributions are used to specify catalogue records in a machine readable notation.
- Other people and systems can publish statements about an entity (i.e. about the catalogue, a dataset in the catalogue or its distribution). They use the URL of the entity to specify the statements in their published data.
- In case of full implementation of Linked Open Data principles, a client application can resolve the URL of an entity (i.e. the catalogue, a dataset in the catalogue or its distribution) using the HTTP protocol. The server returns the machine interpretable representation of the entity. We call such URLs dereferencable URLs.
○ In case of URL of the catalogue, the server returns metadata about the catalogue.
○ In case of URL of a dataset, the server returns the catalogue record for the dataset including metadata about dataset distributions.
○ In case of URL of a distribution, the server returns the part of the corresponding catalogue record related to the distribution.
Ensuring dereferencable URLs can be technically complicated for some organizations. However, if it is ensured, the organization publishes its catalogue metadata and catalogue records as 5* open data.
The organization can choose any convention of entity URLs. However, it is necessary to ensure that URLs of entities are different from URLs of the web pages which present information about those entities to human end users. For example, it is necessary to ensure that the URL of a catalogue homepage is different from the URL of the catalogue itself. Similarly, the URL of a web page about a particular dataset must be different from the URL of the dataset.
If the organization does not want to create own convention for entity URLs, we recommend the following convention. As the base of URLs of all entities, we recommend the pattern
{catalogue-URL}/resource
where {catalogue-URL} is the catalogue URL (see P02A02-02). We call this URL entity URL base and we will denote it {base-URL}. E.g., for Czech Trade Inspection Authority the entity URL base would be
The recommended pattern for the URL of the catalogue is
{base-URL}/catalog
E.g., for Czech Trade Inspection Authority the URL of the catalogue would be
http://data.coi.cz/resource/catalog
The recommended pattern for the URL of a dataset is
{base-URL}/dataset/{dataset-id}
where {dataset-id} is any string which is syntactically correct in the URL and which uniquely identifies the dataset. It is recommended to create it on the base of the dataset name. E.g., a dataset with inspections made by Czech Trade Inspection Authority would have the URL
http://data.coi.cz/resource/dataset/inspections
The recommended pattern for the URL of a distribution of a dataset is
{base-URL}/dataset/{dataset-id}/{distribution-id}
where {distribution-id} is any string which is syntactically correct in the URL and which uniquely identifies the distribution in the scope of the respective dataset. For example, for a distribution of the dataset with inspections made by Czech Trade Inspection Authority in the 1st quarter of 2014 represented in the CSV format would be
http://data.coi.cz/resource/dataset/inspections/2014Q1-CSV
P02A02-04 Choice of target data catalogues
- It is recommended that the organization runs its own data catalogue with catalogue records of its own datasets.
○ The catalogue should be made according to the recommendations in P02A02-02.
- If there is a national central data catalogue in the country of the organization, it is necessary to publish catalogue records in this catalogue as well.
○ If the central catalogue supports DCAT (and VoID) and automated import of catalogue records, we recommend to exploit this functionality for automated copying of the catalogue records from the data catalogue of the organization to the central one.
○ If the central catalogue does not support DCAT (and VoID) or automated import of catalogue records, we recommend the organization to request the administrator of the central catalogue for this functionality.
- If there is no national central data catalogue, the organization should run its own data catalogue. Otherwise, it is only optional recommendation.
P02A03 Description of the datasets
P02A03-01 Preparation of catalogue records
- It is necessary to prepare catalogue records with respect to the designed catalogue record schema (see P02A02-01) for each dataset in the publication plan of the organization.
- Catalogue records should be prepared in the internal data catalogue of the organization.
- Preparation of a catalogue record means filling values of those attributes which are known in the design time. It is not necessary to fill attributes which will be known in the publication time and will be filled automatically if the publication process is automated by a software tool (see P02A06).
P02A03-02 Designing data schemas for datasets published as 3* or 4* data
A data schema describes the required structure and partly also the semantics of dataset items. In case of 3* and 4* data, a data schema for a dataset is optional but recommended. Its existence makes work with the dataset easier for the application developers and other users. It also lowers the risk of misinterpretation of the dataset. The way of publication of the data schema depends on the data format chosen for publishing the dataset. We recommend the following:
- On the base of the chosen data format, determine the language for expressing the data schemas.
○ For CSV format, choose Metadata Vocabulary for Tabular Data[13].
○ For XML format, choose DTD[14] or XML Schema[15].
○ For JSON format, choose JSON Schema[16].
- Create a data schema for each dataset and format in which you publish the dataset. Try to explain as much semantics as possible in a form of descriptions and commentaries. Using proper constructions of the chosen schema language to specify the descriptions and commentaries.
- Independently of the chosen schema language, specify the data types of primitive values using primitive types of the XML Schema language.
- Publish each data schema as a separate file and link that file from the data files with the distributions of the dataset. For linking, use appropriate constructions of the particular language.
- If there is a local data catalogue of the organization and it provides a web page for each dataset, put the link to the data schema for the dataset on the web page.
P02A03-03 Designing vocabularies, ontologies, code lists and taxonomies for datasets published as 5* data
The recommendations presented in this chapter are relevant only for datasets published as 5* data, i.e. Linked Open Data. The recommendations are structured to those related to designing vocabularies and ontologies and those related to codelists and taxonomies.
P02A03-03-01 Designing vocabularies and ontologies
A data schema of a dataset published as 5* data is specified as so called vocabulary or ontology. Often those terms are used interchangeably even though there are significant differences. However, for simplicity we will use them as synonyms and we will use only the term ontology.
An ontology describes an object model of a dataset - it describes classes of entities in the dataset, their attributes and relationships among them. In the ontology terminology, attributes and relationships are both called predicates. An ontology is expressed as 5* data in RDF model. The ontology which describes how ontologies should be expressed in RDF is RDF Schema[17]. It provides basic classes and predicates which enable anyone to define new ontologies, i.e. their classes and properties. There is also OWL[18] language which extends RDF Schema with additional constructs with better expressive power.
When designing an ontology to describe a dataset, we recommend to proceed as follows:
- Depict a conceptual schema of the dataset in a diagram. (It is possible to depict it as a UML class diagram or ER diagram. Those kinds of diagrams are commonly used data engineering techniques. There are many free as well as commercial tools for depicting UML or ER diagrams.) It is not necessary to design a sound and complete conceptual schema. The aim is to summarize the classes of entities in the dataset, their attributes and relationships between them. A graphical diagram is a good way for this. It can also serve as means of communication with persons in the organization responsible for the dataset.
- Existing ontologies should be identified and reused for the classes, attributes and relationships in the conceptual diagram. Existing ontologies which are commonly used in the world of Linked Open Data should be preferred.
- We create a new ontology only for those classes, attributes and relationships in the conceptual diagram which were not covered by any existing ontology. The new ontology describes only those new concepts. The concepts which were reused from the existing ontologies are not described in the new ontology.
○ It is not necessary to follow all known best practices when the new ontology is designed for the dataset. It is not the goal to develop an ontology which will be reusable by other organizations for their datasets. The goal is to develop a basic ontology so that the dataset can be published quickly. It can be refined and improved later.
○ Therefore, we recommend to create the new classes and predicates in the new ontology as direct equivalents of the classes, attributes and relationships described in the conceptual schema. Each new class and predicate must have its URL. We recommend patterns for those URLs below.
- Ontologies (the reused ones and the newly created one) used for the dataset representation should be recorded in the catalogue record. For this, VoID predicate vocabulary should be used (http://rdfs.org/ns/void#vocabulary).
We recommend to search for existing reusable vocabularies in various ontology catalogues. We recommend the following catalogues
- Linked Open Vocabularies [http://lov.okfn.org/dataset/lov/] - Catalogue of recommended ontologies with classification of the ontologies into domains of usage.
- Sindice [http://sindice.com], Swoogle [http://swoogle.umbc.edu/], Watson [http://watson.kmi.open.ac.uk/WatsonWUI/] - Search engines in the Linked Open Data space which can also be used to search for ontologies used in the indexed datasets.
- BioPortal [http://bioportal.bioontology.org/] - A catalogue of ontologies from the domain of life sciences.
- Ontologies maintained by W3C [http://www.w3.org/standards/semanticweb/ontology]
- Ontologies produced by EU ISA initiative [https://joinup.ec.europa.eu/community/semic/og_page/studies#core-vocabularies]
There also exist concrete ontologies which are highly reused by many datasets published as Linked Open Data. We recommend to reuse them because they make the datasets much more reusable.
General Ontologies
- DCMI Metadata Terms [http://www.w3.org/2004/02/skos/core#] - provides basic properties of entities; provides predicates for properties like title, description, creation date, author, language, etc.
- Simple Knowledge Organization System (SKOS) [http://www.w3.org/2004/02/skos/core#] - enables to describe codelists of terms and taxonomies of terms with various kinds of relationships among the terms (e.g., it enables to describe broader and narrower terms for a given term, or terms related with another term)
- RDF Schema [http://www.w3.org/2000/01/rdf-schema#] and OWL [http://www.w3.org/2000/01/rdf-schema#] - ontologies for describing ontologies; they also provide constructs for expressing basic semantic relationships between entities (e.g., a relationship which specifies that two entities are equivalent)
- Data Cube Vocabulary [http://www.w3.org/TR/vocab-data-cube/] - enables to represent statistical data cubes; it is based on SDMX standard.
- Functional Requirements for Bibliographic Records (FRBR) [http://purl.org/vocab/frbr/core#] - enables to represent documents and publications of different kinds
- Open Annotation Core Data Model [http://www.openannotation.org/spec/core/] - enables to represent annotations of documents and publications (i.e. relationships of parts of the documents to related entities published as Linked Open Data).
Domain specific ontologies
- GoodRelations [http://purl.org/goodrelations/v1#] - enables to describe products and offerings
- RegOrg [http://www.w3.org/TR/vocab-regorg/] - enables to describe organizations registered in national registries of organizations
- PublicContractsOntology [http://purl.org/procurement/public-contracts#] - enables to describe public contracts
- LEX Ontology [http://purl.org/lex#] - enables to describe legal documents (acts, court decisions, etc.)
- Friend-of-a-Friend [http://xmlns.com/foaf/spec/] - enables to describe people and relationships between them
Multi-domain ontologies
- schema.org [http://schema.org] - A set of ontologies which cover different aspects and domains.
Ontologies themselves are expressed in RDF model as 5* data. Therefore, classes and predicates defined by an ontology are considered as entities each with its own URL. Even the ontology itself is an entity and must have a URL. The following pattern is recommended to construct a URL of an ontology
{catalogue-URL}/ontology/{ontology-name}
where {catalogue-URL} is the catalogue URL (see P02A02-02) and {ontology-name} should be based on the name of the dataset for which it has been constructed and should be self-explaining. It must uniquely identify the ontology in the set of all ontologies created by the organization. And it should be written in the CamelCase notation (e.g. InspectionResults instead of inspectionResults or inspection-results). We will call this URL ontology URL and we will denote it with {ontology-URL}.
The URL of a class or predicate in the new ontology should be created according to the pattern
{ontology-URL}/{class-or-predicate-name}
where {class-or-predicate-name} is a string which uniquely identifies the class or predicate in the scope of the ontology. It should be based on the name of the class or predicate and should be self-explaining. The class name should be written in the CamelCase notation. The predicate name should be written in the camelCase notation (i.e. should start with a lower case letter).
For example, Czech Trades Inspection Authority needs to create an ontology with classes and predicates specific for their domain of inspection results. The URL of the ontology could be
http://data.coi.cz/ontology/InspectionResults
The URL of the class which represents confiscated goods can be
http://data.coi.cz/ontology/InspectionResults/Confiscation
The URL of the ontology should be dereferenceable. When the URL is resolved, class and predicate definitions should be returned in an appropriate RDF notation, e.g., TTL.
P02A03-03-02 Designing codelists and taxonomies
Besides the ontologies it is also necessary to correctly design codelists of terms and taxonomies of terms which are used as prescribed values of properties defined by an ontology. Since a codelist is a special case of a taxonomy (codelist is a taxonomy without semantic relationships between the terms) we will speak only about taxonomies. However, all recommendations are applicable to codelists as well.
We recommend to proceed in the following steps:
- Identify predicates in the designed ontology whose values should be from a taxonomy (e.g., a taxonomy of geopolitical regions).
- First, it is recommended to try to search for existing taxonomies published as 5* data which can be reused in your 5* data.
○ Most reused taxonomies are
■ Multilingual thesaurus of European Union (EUROVOC) [http://open-data.europa.eu/cs/data/dataset/eurovoc]
■ Nomenclature of Economic Activities (NACE) [http://ec.europa.eu/eurostat/ramon/ontologies/nace.rdf]
■ Nomenclature of territorial units for statistics (NUTS) [http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/]
■ ISO 639-1 - language codes [http://www.loc.gov/standards/iso639-2/php/code_list.php]
■ ISO 4217 - currency codes
[http://www.iso.org/iso/home/standards/currency_codes.htm]
■ ISO 3166-1 - country codes
[http://www.iso.org/iso/country_codes.htm]
■ UN/CEFACT units of measurement
[http://www.unece.org/fileadmin/DAM/cefact/recommendations/rec20/rec20_rev3_Annex3e.pdf]
○ It is recommended to search other taxonomies in data portals of international organizations, e.g., World Health Organization (http://www.who.int/research/en/), World Bank (http://data.worldbank.org/) or United Nations (http://data.un.org/). It is also possible to reuse services of European Union (http://publications.europa.eu/mdr/authority/index.html, http://ec.europa.eu/eurostat/ramon)
- If no appropriate taxonomy can be found or we found only a taxonomy which covers our needs only partially, we create a new taxonomy.
- To express a new taxonomy, SKOS (Simple Knowledge Organization System) ontology must be used.
A new taxonomy should have a URL in the form
{catalogue-URL}/taxonomy/{taxonomy-name}
where {catalogue-URL} is the catalogue URL (see P02A02-02) and {taxonomy-name} should be based on the name of the dataset or domain for which it has been constructed and should be self-explaining. It must uniquely identify the taxonomy in the set of all taxonomies created by thhttp://www.w3.org/2004/02/skos/core#e organization. And it should be written in the CamelCase notation. We will call this URL taxonomy URL and we will denote it with {taxonomy-URL}.
The URL of a term of the taxonomy should be created according to the pattern
{taxonomy-URL}/{term}
where {term} is a string which uniquely identifies the term in the scope of the taxonomy. It should be based on the term itself and should be self-explaining.
P02A03-03-03 Recommended basic structure of each entity in a dataset published as 5* data
To increase reusability of our datasets by different applications, it is recommended to reuse existing ontologies as explained above. However, even those ontologies often overlap in predicates which represent basic attributes like title or description of an entity. Therefore, it is recommended to use the following properties for basic attributes:
Entity attribute | predicate | Ontology |
title | dcterms:title | DCMI Metadata Terms, if Schema.org is not used for the dataset |
title | s:name | Schema.org, if it is used also for other attributes or classes in the dataset |
description | dcterms:description | DCMI Metadata Terms, if Schema.org is not used for the dataset |
description | s:description | Schema.org, if it is used also for other attributes or classes in the dataset |
preferred title (if there are more titles of the entity) | skos:prefLabel | SKOS |
alternative title (if there are more titles of the entity) | skos:altLabel | SKOS |
creation date | dcterms:created | DCMI Metadata Terms |
last modification date | dcterms:modified | DCMI Metadata Terms |
address | s:address | |
GPS coordinates | s:geo |
P02A03-04 Determining entity identifiers
Datasets represent real-world entities in a form of rows in tables, elements in XML documents or objects in JSON documents. It is important to know what the identifier of each entity is. For example, in a table we need to know what column or columns form identifiers of the entities represented in the table.
Identifiers are important for application developers. They use them to identify entities in their source code and to fuse information about the entities from different data sources. Therefore, it is a good practice to follow some basic rules summarized below:
- Each entity should have an identifier which should be composed of only one attribute of the entity.
- The entity identifier should not be an artificially generated meaningless value stored as a primary key in the database. It should be a value which is used for sharing information about the entity in the real world and by real systems.
○ For example, business organizations in a given country can be identified by an organization identification number. This number is used by different public authorities and their information systems to identify the entity.
- Only when no real-world identifier exists for the entity it is possible to use own identifier. It can also be an artificially generated meaningless value.
- Each identifier must be described in the data schema of the dataset. The way of description depends on the constructs of a particular schema language.
P02A03-05 Convention for URLs of entities in datasets published as 4* and 5* data
Recommendations in this section are relevant for datasets published as 4* and 5* data. For these datasets it is moreover necessary to design URLs of their entities. To construct the URLs we use the identifiers determined according to P02A03-04. Each entity in the dataset must have a URL which meets the following rules.
- It is used in the machine readable representation of the entity as the entity identifier.
- It does not change during the whole lifecycle of the entity.
- It is used to link the entity from other related entities in the same or other datasets (of the same or other organizations). It is similar as a URL of a web page which is also used by other web pages to link the web page.
- In case of 5* data, the URL is dereferenceable. I.e. when a client requests the URL via the HTTP protocol it receives a machine readable representation of the entity in RDF format serialized in an appropriate format (e.g., TTL or JSON-LD) from the server.
The organization can design any convention for URLs of entities when the rules 1-3 are assured. It is necessary to assure the rule 4 for entities published as 5* data. When the rule 4 is not assured, the dataset is not published as 5* data.
If the organization does not want to design own convention for URLs, we recommend the following pattern:
{base-URL}/{type}/{identifier}
where
- {base-URL} is the entity URL base (see P02A02-03)
- {type} is a string which is based on the name of the class of the entity. If the entity belongs to more different classes, it is recommended to choose the most specific one. It is also recommended to write {type} in the lower case notation where individual words are split by the ‘-’ character. E.g., if the class has name Check Action then {type} is check-action.
- {identifier} is the value of the identifier of the entity.
For example, Czech Trade Inspection Authority publishes its inspections which are considered as individual entities. Inspections are instances of class http://schema.org/CheckAction. The name of the class is Check action. Inspections do not have any real world identifier which would be shared by different organizations. The Authority generates own artificial identifiers. An example of the URL of an inspection made by the Authority is
http://data.coi.cz/resource/check-action/101201020952801
Another example are organizations inspected by the Czech Trade Inspection Authority. The Authority represents organizations as instances of class http://purl.org/goodrelations/v1#BusinessEntity. In Czech Republic, public authorities use official identification numbers of organizations used to identify organizations. The numbers are maintained by the Czech Statistical Office. Therefore, the Czech Trade Inspection Authority uses those numbers as identifiers in its URLs of inspected organizations. An example of such URL is
http://data.coi.cz/resource/business-entity/24693782
It is not necessary to create URLs for all entities. Sometimes it is sufficient to reuse a URL which is assigned to the entity by another organization in its own dataset. This is typical in case of various codelists and taxonomies published by various standardization organizations and reused in many datasets.
The decision whether the organization should design own URL for an entity or an existing URL should be reused is not straightforward. It is not possible to state that the organization has to create a URL of an entity only when there is no other URL of the entity has been created by another organization. It is common in the Linked Open Data space that a real-world entity (e.g., a business entity) has more different URLs assigned by different organizations. E.g., a business register in Czech Republic can create one URL of an organization and Czech Trade Inspection Authority can create another one. These URLs can be specifically linked (see P02A03-06 for more details) so that it is stated that they are both different URLs of the same real-world entity. The following recommendations can help with the decision making:
- The organization has to design its own URL when it is necessary to ensure that a URL of an entity is dereferenceable so that a client can receive data about the entity published by the organization.
- The organization has to design it own URL when it is not necessary to ensure that a URL of an entity is dereferenceable but the organization wants to offer a URL so that other datasets can refer the entity.
In any case, if the organization creates own URL of an entity and it is known that another organization created another URL for the same entity then this other URL should be linked as specified in P02A03-06.
P02A03-06 Designing links to other datasets published as 5* data
The last step of designing the form of publication of a 5* dataset is to design links to other datasets. These links contribute to the global network of Linked Open Data.
The organization creates basic links by reusing URLs from common codelists and taxonomies in a dataset as discussed in Designing codelists and taxonomies. By composition of the links with the links to the same codelists and taxonomies from other datasets, we can derive links between the dataset and these other datasets. The organization also creates links from its dataset to other datasets when it reuses URLs of entities from the other datasets instead of creating own URLs of the entities (see P02A03-05).
Other links can be designed on the base of the following recommendations.
First, it is necessary to identify datasets which can be potentially linked from the dataset of the organization. We will call those datasets external datasets. This can be done by the following recommendations:
- Analyse which entities in the dataset are maintained also by various national or international registries which publish their dataset as 4* or 5* data and contain information about the same entities. These datasets are candidates for linking with the dataset.
○ Typically, these registries are registries of public authorities and statistical offices, geographical registries, registries of legislation, etc.
- Analyse which organizations of the same or similar kind as your organization publish external datasets of the same or similar kind as your dataset. Analyse whether there are some overlaps in the published entities. If there are some overlaps, the respective external datasets are other candidates for linking.
○ For example, Czech Trade Inspection Authority can find some overlaps with datasets published by other inspection authorities in Czech Republic or in other European countries.
- Usually, it is possible to link some entities in the dataset to entities published by various encyclopaedic data sources as 4* or 5* data. Such data sources are, e.g.:
○ DBPedia [http://dbpedia.org] - an image of Wikipedia in a form of 5* data. An entity has a DBPedia URL in a form http://dbpedia.org/resource/{entity-name}, where {entity-name} is based on the entity name in Wikipedi such that http://en.wikipedia.org/wiki/{entity-name} is the Wikipedia page about the entity. E.g., the URL of a web page about a drug Ibuprofen on Wikipedia is http://en.wikipedia.org/wiki/Ibuprofen. DBPedia URL of Ibuprofen therefore is http://dbpedia.org/resource/Ibuprofen.
○ Freebase [http://freebase.org] - an open database of famous people, places and other organizations run by Google, Inc.
For each identified external dataset with a linking potential, it is necessary to design the links. It is recommended to proceed in the following steps.
- If the same entities appear in both, the dataset of the organization and in the external dataset, but with different URLs, the organization should link URLs of the entities in its dataset with URLs of the entities in the external dataset using the predicate owl:sameAs. This predicate is defined by the OWL ontology. It says that two different URLs identify the same real-world entity. A client receives different data about the entity by resolving those URLs (because each resolves to data from different datasets). On the base of the owl:sameAs statement, the client resolving one of the URLs can deduce that there is another URL where more information about the entity can be obtained if the statement is present in the data obtained from the first URL.
○ The links should be published by the organization as a separate dataset. A dataset which contains only links is called linkset. VoID offers specific constructs to express metadata of linksets.
- If there are entities in the external dataset which are somehow related or enrich some entities in the dataset of the organization, the organization should link URLs of the entities in its dataset with URLs of the entities in the external dataset using an appropriate predicate.
○ The linking predicate should be chosen from some existing commonly used ontology (see Designing vocabularies and ontologies).
○ It is also possible that the documentation of the external dataset or its ontology specifies which entities should be linked with what predicates. In that case use those predicates.
○ If no predicate can be found, it is necessary to create a new predicate as described in P02A03-03.
○ The links should be published by the organization as a separate linkset.
Let us show some sample links for entities from datasets published by Czech Trade Inspection Authority. According to the recommendations in P02A03-03, resp. the part Designing codelists and taxonomies, the Authority identified the taxonomy of geopolitical regions of European Union, so called NUTS codes. Inspections are linked to this taxonomy. A sample URL of a NUTS region CZ010 is
http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/CZ010
A dataset with inspections therefore uses an appropriate predicate to link inspections to those URLs of NUTS regions.
It is also possible to design other links. There is an official business register maintained by the Ministry of Justice of Czech Republic. If the register would be published as 4* or 5* dataset there would be an URL for each business organization in that dataset. A sample URL is
http://data.justice.cz/resource/business-entity/12345678
Czech Trade Inspection Authority assigns its own URL for the inspected organizations. The same organization would be identified in the dataset of the Authority with the following URL:
http://data.coi.cz/resource/business-entity/12345678
The Authority should publish a linkset which links the URLs of organizations created by the Authority to the URLs of those organizations in the business register. The owl:sameAs predicate should be used. A client who knows the URL of a business organization created by the Authority can now easily and automatically gather data about the business organization from other linked datasets.
Last but not least, we recommend to describe the patterns of entity URLs in the VoID description of the dataset. This makes decisions of publishers of other datasets about links to your dataset much easier.
P02A04 Selection and implementation of the software tools
For datasets publication, it is recommended to choose several software tools which implement components of a reference architecture depicted on Figure 1. These software tools will facilitate and automate the process of datasets publication.
Figure 1: Reference architecture of software tools for open data publication
The components of the reference architecture are following:
- Internal data catalogue is a component for recording all identified datasets in phase P01 and filling their catalogue records during phase P02. It is not a public data catalogue. It contains records about datasets which will be published sometimes in the future and also those where the organization is not yet sure if they will be published. It can be implemented in a form of a spreadsheet file and edited with a tabular editor. However, we recommend to choose a tool for dataset cataloguing which supports DCAT and other recommendations from P02A02.
- ETL (Extract-Transform-Load) tool is a software tool which enables to configure so called ETL procedures. An ETL procedure prepares a dataset listed in the internal data catalogue for publication. Part E (Extract) collects data from the data sources prepared according to recommendations in P02A01. Part T (Transform) implements a set of transformations, incl. data format conversions, data cleansing, anonymization, linking and enrichment. Part L (Load) enriches the prepared dataset with metadata (catalogue record) and loads the dataset to the internal data store.
○ It is possible to choose any ETL tool. We recommend a tool which enables to schedule periodical runs of designed ETL procedures.
○ However, it is also possible to realize the ETL procedures manually, without a software tool. This can be reasonable in case of datasets which are published only once or with a very long periodicity.
- Internal data store is a space where prepared datasets are stored. It can be a local file system of the organization, an appropriate database server or a combination of both. It depends on the chosen data formats and the way of making the datasets available for the end users.
- Open data interface makes the published datasets available for end users. Data catalogue connectors export catalogue records to data catalogues chosen as recommended in P02A02. One of them is the public data catalogue of the organization if the organization has one (see P02A02-02). Export to data files takes the datasets from the internal data store and offers them to the end users for bulk downloads. REST API / SPARQL Endpoint are services which provides application developers with direct programmatic access to the datasets.
- Data catalogue of the organization is a web page or a set of web pages structured according to P02A02-02.
When publishing open data all applicable security policies and standards must be followed.
The COMSODE project is developing an implementation of the reference architecture. It is called Open Data Node (ODN). Recommendations, practices and procedures of correct implementation of ODN with respect to this Methodology will be published by the COMSODE project as a deliverable D5.2 “Methodologies for deployment and usage of the COMSODE publication platform (ODN), tools and data”.
P02A05 Definition of the approach to the dataset publication
The goal of this task is to define the overall approach to the dataset publication. Activities and steps during which the datasets are prepared and subsequently published should be described. It should be also determined what of these activities and steps will be automated using the ETL (see the task P02A06 Design and implementation of the ETL procedures) procedures and what will be performed manually.
Concrete steps required to prepare and publish might depend on the nature of the dataset. However in general it should be analysed how the following activities and steps will be performed:
- transformation and cleansing of datasets,
- anonymization of datasets,
- updates of datasets.
In case of Linked Data it data integration approach and linking to other datasets should be defined.
P02A05-01 Approaches to data cleansing
Data cleansing is a complex problem which includes many activities. Because of its importance, this Methodology provides a separate cross-cutting activity CA01 focusing on this domain.
P02A05-02 Approaches to data anonymization
Data anonymization is another complex problem. This Methodology comprehends the anonymization problem from the point of view of risk management - a risk that protected data will be published. Therefore, it is explicitly described in the cross-cutting activity CA03. If there is a risk of publishing protected data identified it is necessary to choose an appropriate anonymization technique.
- Projection - particular attributes with protected data are removed from the dataset. E.g., in case of tabular files, it means removing a column or columns.
- Aggregation - merging several items to one statistical item (e.g. merging persons and their ages in a region and publishing only the average age of the persons in each region).
- Removing links - when links were created for 5* dataset to other datasets according to P02A03-06 it is necessary to analyse whether some protected data are revealed. If so, it is necessary to remove those links before publishing the dataset.
For each anonymization technique it is necessary to analyse whether there is a way for deanonymization of the anonymized data. This is specific for each dataset and it is a complex problem. Concrete solutions to this problem are out of the scope of this Methodology.
P02A05-03 Approaches to data integration
Data integration is the last complex problem mentioned explicitly by this Methodology. By data integration we mean either enrichment of a dataset with information from another dataset, so called fusion, or linking two datasets without changing their content. There are various techniques for data integration. Their full implementation, however, would be very complex for the organization and it would make the process of publication very costly and time consuming. Therefore, we recommend to fulfil only the recommendations for identifiers of entities and their linking as described in P02A03-06. Advanced integration techniques can be applied later by the organization. It is also possible that better integration will be done and published by independent developers who reuse the open datasets in their software applications.
P02A05-04 Approaches to updates of datasets
It should be determined whether the updated data will be published as a separated data source or a single data source which contains complete dump of the data. In the first case, a new data file is added with every update, for example national foreign trade statistics provided in separate files for years 2012 and 2013. The latter case might be represented by a situation in which the data source is overwritten with new release of data every time it is updated. For example the Czech Trade Inspection Authority publishes their data about inspections, sanctions and bans quarterly. Complete dump is generated every three months and the previously released data is replaced with this new dump[19]. Preferences of the users might be taken into consideration when deciding about the optimal approach.
P02A06 Design and implementation of the ETL procedures
An ETL procedure automates the process of publication of a dataset designed according to the recommendations in P02A05. It is not necessary to automate publication of each dataset. If a dataset is published only once or with a very long periodicity, it may be more effective to publish the dataset manually. In other cases, it is recommended to automate the publication.
Designing an ETL procedure means to describe from where and how data should be extracted (E), how the data will be transformed from the source to target form (T) and how the resulting dataset will be published (L). Depending on the way of publishing updates of the dataset it may be necessary to design one or two ETL procedures:
- An ETL procedure for preparation and publication of a complete and current content of the dataset. The result is a set of all items which exist in the dataset in the time of publication. The result is independent on the previous version of the dataset.
- An ETL procedure for preparation and publication of items of the dataset which have been changed since the last publication of the dataset. The result is a set of items which have been changed. This includes newly created items, updated items and information about which items have been removed from the dataset.
- An ETL procedure for preparation and publication of the sequence of changes made in the dataset since the last publication of the dataset. The result is not a set of items but a sequence of changes.
The first ETL procedure must be designed in any case. It is necessary for the first (initial) publication of the dataset. If a complete dataset is published in the given period the ETL procedure is used as well. The other ETL procedures are necessary if only changes in the dataset are published in each period.
An ETL procedure is described as a sequence of steps. It is recommended to describe the steps at high level of abstraction so that it is possible to parameterize them for particular purpose of the dataset. This will ensure that the steps are components which can be reused for different purposes of different datasets.
P02A06-01 Designing extractors
In P01A05, the organization identified its available sources of data. In P02A01, it was specified how to access those sources from external programs, e.g., ETL procedures. Designing extractors means designing components of ETL procedures which access the sources of data and extract required data from them. The extractors do not perform any transformations of the data. Typically, it is necessary that the chosen ETL tool supports the following components which can be used as extractors in ETL pipelines:
- A component which downloads a data file from a given URL.
- A component which copies a data file from a local file system.
- A component which accesses a relational database with SQL queries (SELECT).
- A component which accesses an RDF database with SPARQL queries (SELECT, CONSTRUCT).
For each dataset, it is necessary to identify its extractors and configure them (e.g., to provide a path to a source data file in a local file system or an SQL SELECT query to extract required data from a relational database).
P02A06-02 Designing transformers
In P01A03, the organization chose the target data formats for publishing datasets. In P02A03, data schemas for the datasets were designed. Designing transformers means designing components of ETL procedures which prepare the dataset in the designed form. In case of 5* data it also means to design components which create links to other datasets which were identified according to recommendations in P02A03-06. Transformers change not only data formats and structure of datasets but also their content, e.g., cleansing and anonymization of the dataset. Typically, it is necessary that the chosen ETL tool supports the following components which can be used as transformers in ETL pipelines:
- Components for transforming structure and data format conversion
○ A component for transforming proprietary tabular formats (XLS(X), ODS, DBF, etc.) and results of SQL queries (SELECT) to the CSV format.
○ A component for transforming XML files to other XML files on the base of XSLT scripts.
○ A component for transforming JSON files to other JSON files.
○ A component for transforming JSON files to XML files and vice versa.
○ A component for transforming CSV, XML and JSON formats to RDF representation. In case of XML, it can be based on XSLT scripts.
○ A component for transforming RDF representation using the SPARQL language.
- Components for transforming content of a dataset
○ Components for data cleansing
○ Components for data anonymization
- Components for data integration
○ Components for linking datasets with other datasets (using the SPARQL language)
○ Components for enriching datasets with content of other datasets on the base of created links
- A component for automated and manual filling of metadata of datasets according to the schema designed as recommended in P02A02-01.
P02A06-03 Designing loaders
The last part of ETL procedures are loaders. They are components which ensure that the dataset is exported to the data files or loaded to a database server. Concrete recommendations for a dataset are:
- If the dataset will be available for users only for bulk downloading the ETL procedure should load the data files to a location which can be accessed by users via HTTP or FTP protocols. It is also possible to load the files to a Git server, e.g., the public Github.com.
- If the dataset will be available via an API the ETL procedure should load the data to a database server.
○ For 3* and 4* data, the API should be a REST service which is able to provide the programmatic access to the items of the dataset and return the representation of the items in CSV, XML or JSON formats. The data should be stored in a relational database or in a NoSQL database.
○ For 5* data, the API should be a SPARQL endpoint. The data should be stored in an RDF database or in a relational database with a layer which enables to view the relational data as RDF data and evaluate SPARQL queries.
- For 5* data, it is also necessary to ensure that URLs of entities in the dataset are dereferenceable. That means that if a client resolves an entity URL it receives a machine readable RDF notation of the entity.
P02A07 Testing of the ETL procedures
Together with the proposal of ETL procedures it is also possible to create test data and test scenarios for testing the ETL procedures. Each implemented ETL procedure must be tested with test data according to the testing scenarios. Testing will be also performed later during the routine operation of the ETL procedures.
- Testing data should be prepared by choosing concrete items of the dataset in its source form.
- For the testing data, the target form which should be produced by the ETL procedure should be prepared manually.
- The first testing scenario should be a test that the ETL procedure executed with the testing data in their source form produces the result equal to the manually prepared target form.
- The second testing scenario should be validation of the produced data against the designed data schema. In case of 5* data, strict validation is not possible. We propose to create a SPARQL query which checks that mandatory predicates are present in the result.
- Other testing scenarios should be various queries on the result of the ETL procedure which check, e.g., presence of various values for particular attributes/predicates or links to other datasets (in case of 5* data). Concrete queries depend on the tested ETL procedure.
P02A08 Licensing
The last part of the preparation for publishing datasets is decision about licenses or terms of use. It is preferred to use standardized licences or terms of use. If necessary, it is possible to propose your own licence or terms of use.
Prior to making a decision on whether and/or what terms of use for datasets licensing shall be used it is necessary to answer the following questions:
- is the said dataset (or the dataset items) protected by sui generis database right (“SGDR”)/copyright, i.e. can this be considered a work/a database as defined by relevant legislation? Based on the answers to the above questions it is then possible to make a conclusion on im/possibility to licence the said dataset.
- Is public sector body (“PSB”) entitled to grant licences to the said dataset, i.e. are there any limitations for PSB in terms of the dataset licensing?
- Which particular licence shall be opted for the said dataset?
Note - in order to clarify some issues and/or pose particular problems this part of the methodology uses terms and provisions contained in the Czech Copyright Act (“CCA”).
A. CCA sets out following possible ways dataset/dataset items legal protection
- dataset is a database in which the selection or the arrangement of the contents of the database is the author's own intellectual creation - copyright protection of the database structure;
- dataset contains items which are considered to be works in terms of CCA - copyright protection of the database items;
- dataset is a database which is arranged in a systematic or methodical manner and individually accessible by electronic or other means, irrespective of the form of the expression thereof - SGDR protection.
There can be a case when PSB holds database and e.g. the database items are not copyright protected as they are not works in terms of CCA, or these can be works in terms of CCA but the works are not copyright protected by virtue of law; or the dataset as such, although being a database as defined in CCA, is not SGDR protected due to exceptions set by CCA:
- I. there is no work - for example raw data which does not fulfil definition of work under CCA;
- II. there is/can be a work - for example legal act as an official work; there is a public interest in not granting copyright protection to these types of works;
- III. dataset is a database in terms of CCA but is not granted SGDR protection due to legal exceptions - for example a dataset as part of a legal act.
In the above cases (I., II., III.) it should be refrained from the dataset/dataset items licensing as there is no copyright/SGDR protection (the above recommendation is not applicable in all cases, indeed - for example in III.case it is possible that the dataset structure is granted copyright protection and thus it still might be necessary to license the dataset).
Another case when licensing should be refrained from is a work in public domain, for example a dataset item/dataset that used to be a work/database but the period of duration of economic rights/SGDR had already expired. Although the said dataset item/dataset shall not be licensed anymore, it is still necessary to take into consideration moral interests lying in for example an obligation to attribute.
Dataset content can consist of a non-official work or any other work than official work that is held by PSB; dataset can be part of an official work other than a legal act:
- non-official work - a work not included in the list of official works in CCA (e.g. statistics of a PSB decision making process); as a work it is granted copyright protection;
- dataset is part of an official work other than legal act - in this case SGDR is granted to the dataset (provided other conditions set in CCA are fulfiled).
In the above cases (1., 2.) the dataset/dataset items shall be provided to end-users by using licenses.
B. PSB may not be entitled to license dataset e.g. in the following cases:
- there is a restriction given by data protection legislation[20];
- there is an obligation of PSB to keep information confidential;
- there are third party rights (e.g. employees´ rights from employment contracts; subcontractors´ rights from contracts for work; other contractual parties´ rights from any other contracts).
Also, PSB e.g. does not need to exercise economic rights to some dataset items. Provided PSB decides to license the dataset under CC:BY v4.0 (see part C.) which now allows to grant license also to databases (in a way described under part A. - a.,b.,c.), the PSB should first obtain an approval from relevant data items´ right holders so that the PSB will be entitled to license them. In case the PSB does not get the approval, this information should be clearly stated.
C. Below there can be found a couple of recommendations for PSB deciding to set terms of a dataset use via licenses:
➔ PSB should opt for a standard licence - these do not result from individual negotiations; they are addressed towards indefinite recipients; they contain standard terms well known in advance to both license providers as well as license recipients; they are continuously altered by professionals in a field together with public participation; they intend to be globally usable to the extent possible. Currently the most well-known standard licenses which allow for dataset licensing are Creative Commons (“CC”)[21] licenses and Open Knowledge Foundation (“OKF”)[22] licenses.
➔ Using CC0[23] can be considered as an appropriate tool from the end-user perspective - means PSB waives its rights to the dataset (i.e. database, database structure, database items) to the maximum extent allowed by national legislation. However, waiving of moral rights is generally a problem (including the Czech Republic). It means that e.g. in the Czech legal environment end-users cannot rely only on the CC0 text itself but they need to be aware of the national legislation as well (see above “to the maximum extent allowed by national legislation”). In some cases even using CC0 may, in line with national legislation, lead to the end-user obligation to attribute although this obligation is not clear from CC0 text itself. Thus the situation may become similar to that of using CC:BY (se below in this part). From the PSB perspective CC0 is an appropriate tool as it is in line with the open data idea with minimum restrictions. Nevertheless, PSB may under particular national legislation impose on end-users further obligations and restrictions which does not arise from CC0 itself.
➔ CC:BY license, so-called Attribution, is another frequently recommended tool for licensing datasets[24] - it is one of mostly well-known CC licenses and its use is in line with open data definition. Fourth version of CC licenses (CC v 4.0) is applicable for dataset licensing also in cases where license provider holds both copyright to database structure/database items as well as SGDR. Thus CC:BY is newly an alternative to ODC-BY license[25] (part of licenses drafted by OKF). Comparing to CC:BY, ODC-BY does not grant to end-user right to use database items which can be copyright protected. That means ODC-BY includes only SGDR and copyright to database structure whereas CC:BY also allows for database items licensing (provided there is no restriction to do so on the side of PSB - see part B.).
Using of CC:BY for particular dataset licensing especially leads to an obligation of end-users to attribute a source in case the database is publicly shared. Nonetheless, use of e.g. raw data, ideas, facts (i.e. material not capable of copyright protection) mostly would not lead to an obligation to attribute a source although the dataset itself is provided for re-use under CC:BY (there could an exception in case of copyright protected dataset structure when raw data re-arrangement could lead to violation of copyright to the dataset structure). Use of CC:BY further leads to an obligation of end-user to attribute a source in case a dataset is SGDR protected and the end-user shares the whole or substantial part of database content. The above obligation would not be applicable in case dataset sharing in a place where there is no SGDR incorporated within national legislation (however there could be an obligation to attribute provided there is copyright protection of the dataset structure) or not substantial part of the database content is being shared by the end-user although being shared in a place where SGDR is incorporated in national legislation (but again - an obligation to attribute could be kept provided the dataset would be also protected by copyright to its structure).
➔ Is it not recommended to use license restricting ways of dataset use, e.g. CC:BY:NC[26]. Use of CC:BY:NC means that end-users are not allowed to extract or use substantial part of database for commercial purposes regardless the dataset is shared publicly or not. There could be also other restrictions in use of the dataset leading to contradiction to open data definition.
➔ It is not recommended to use license restricting derivative works/databases creation, e.g. CC:BY:ND[27]. Use of this license means that end-user is e.g. not entitled to insert substantial part of the licensed database into other publicly shared database to which he has SGDR. Therefore it leads to dataset re-use restrictions which is not in line with PSI Directive.
➔ It is not recommended to use license requiring to keep the license, e.g. CC:BY:SA[28]. End-users are obliged to license derivative results under the same license. This can restrict further re-use of the licensed dataset (see e.g. (Dulong De Rosnay and coll., 2014)).
➔ It is not recommended to create own licenses. If PSB decides to draft its own license it is highly recommended to keep principles set in CC:BY or CC0 alternatively.
P03A01 Initial publication of the dataset
The prepared ETL procedures are executed and the first version of a dataset is published. The preparation and publication process proceeds according to the steps designed in P02A05 and P02A06. Testing and validation of the results follows as described in P02A07. After the dataset is published, it is also checked that the dataset is available as specified in P02A06-03.
P03A02 Data cataloguing
In task P02A02, the catalogue record schema is designed and target data catalogues are chosen. Catalogue records for datasets are recorded in the internal data catalogue (P02A03-01). The goal of this task is to publish the catalogue records in data catalogues selected in P02A02-04. Before publishing the catalogue records it is necessary to fill the values which were not prepared in P02A03-01, i.e. values which are known after the dataset is prepared and published. For example, it is the date of the publication, the date of the last modification of the dataset, or the links to the data files for downloading the dataset. These values can be filled manually or automatically by ETL procedures before publishing the catalogue records.
P03A03 Dataset maintenance
The goal of this task is to ensure continuous updates of datasets and catalogue records. It involves publishing new versions of a dataset or changes made to the dataset. It also involves corrections of the data in the dataset if errors were identified. For example, users of the dataset may inform about errors and the organization should apply appropriate corrections.
If the preparation and publication of the dataset is automated by an ETL procedure, it is also necessary to continuously monitor its status and health. It is also a good practice to automate the designed testing scenarios for the ETL procedure (see P02A07) and monitor the results of the tests. It is also necessary to monitor the status and health of the data source used by the ETL procedure to extract required data. In case of any problems a correction is necessary.
Dataset maintenance might also involve manual maintenance of datasets in case that the preparation and publication of the dataset is not automated or if there are some manual operations. E.g. manual publication of updated data and updated of the respective catalogue record.
If a change to the structure of the dataset occurs it is necessary to analyse whether the data schema is changed. If so, we recommend to archive the dataset (see P04A01 and P04A02) and set up a new dataset.
User feedback might be a valuable input into the dataset maintenance. Users might be able to identify and report errors in the datasets (Both and Schieferdecker, 2012). Errors in datasets and metadata or other issues related to the data provision might trigger dataset maintenance activities, e.g. release of the corrected data file and update to the respective catalogue record. Therefore alongside the regular and scheduled dataset maintenance activities (usually following the dataset update periodicity), activities of this task might be performed ad hoc as a response to (but not limited to) the user feedback.
Because the user feedback is on the inputs into this task and because this feedback might trigger the data quality improvement activities, activities of this task should be coordinated with the data quality management (CA01) activities as well as the activities of the communication management (CA02).
P04A01 Termination of the dataset maintenance
If the primary data of the dataset is no longer collected or if the structure or meaning of the data significantly changes so that it is necessary to create a new datasets, maintenance of the dataset should be terminated. If it is possible and feasible to keep the previously published data available to it should be kept available for those who reuse with the datasets, i.e the dataset should be archived. Respective catalogue records should be kept available as well. However it is necessary to add information to the catalogue record that the dataset is no longer maintained. Users should be informed in the change in status of the dataset (see cross-cutting activity CA02 Communication management).
It is very important to inform the users that the maintenance of the dataset is going to be terminated. The communication strategy should govern how the users are informed about the change in the datasets status or availability. For more detailed guidelines please see the (CA02A09) Informing about the termination of maintenance or publication of a dataset task.
P04A02 Termination of the dataset publication
If it is no longer possible to publish some dataset, its publication should be terminated. Requirement to terminate publication of some dataset might result for example from the changes in legislation or from the ruling of the court. As a consequence of these events publication of the dataset might turn out to be violating some obligations or someone’s rights. In such case it is necessary to block the public access to the dataset and its resources. However the catalogue record should be kept available if possible and it should be updated with information that the particular dataset is no longer available. Users should be informed in the termination of publication of the dataset (see cross-cutting activity CA02 Communication management).
It is very important to inform the users that the publication of the dataset is going to be or was terminated. The communication strategy should govern how the users are informed about the change in the datasets status or availability. For more detailed guidelines please see the (CA02A09) Informing about the termination of maintenance or publication of a dataset task.
CA01A01 Data quality requirements analysis
The multi-dimensional nature of data quality makes it dependent on a number of factors that can be determined by analysing the user requirements. Data quality requirements specify the dimensions required to be tagged, or otherwise documented for the data, so that at query time users can retrieve data of specific quality (i.e., within some acceptable range of quality dimensions values). A data quality dimension is a data dimension that provides objective information about the data quality.
Over the years, a lot of quality dimension was released, and many methodologies are also available in the field of data quality. An extended and deep analysis of existing methodologies and quality dimensions are available in (Batini, Cappiello, Francalanci & Maurino, 2009). According to Batini at al. (2009) the most common and frequently used quality dimensions are:
- Completeness is defined as the extent to which data are of sufficient breadth, depth, and scope for the task at hand.
- Consistency captures the violation of semantic rules defined over data elements.
- Accuracy is defined as the closeness between a value v and a value v’, considered as the correct representation of the real-life phenomenon that v aims to represent.
- Time related dimensions (namely timeliness, volatility, updateness) are defined as the extent to which data are updated.
There are different strategies for collecting quality dimensions. For example a use case approach foresees to define a use case scenario or an application build on top of the dataset since the evaluation of the dataset quality depends on the specification of the data consumer or application. Thus, the data quality requirement analysis phase focuses on the collection and discovery of requirements. A deep analysis of the requirements, through use cases or applications, specifies general suggestions on possible causes of errors and determines future targets to be achieved for data quality.
Other strategies for requirements analysis are proposed in (Jeusfeld, Quix & Jarke, 1998), (English, 1999), (Loshin, 2004) and (Batini and Scannapieco, 2006). Both (Jeusfeld, Quix & Jarke, 1998) and (English, 1999) are developed in the context of data warehouse and as a consequence can be useful mainly for statisticians data.
(Loshin, 2004) is a cost effective methodology that start by identifying kinds of errors related to data. Such methodology include a very rich taxonomy of quality error and related cost.
In (Batini and Scannapieco, 2006) authors proposes, among others, to survey the opinion of data users and administrators to identify quality issues and identify new quality targets
Anyway because data quality is mainly in the user’s eyes not in the data producers ones, the selection of the appropriate data quality metrics a quite ambitious tasks. In fact there are different metrics for measures the accuracy of postal addresses, but it is very depending to the application of such data. As a consequence it is most important to consider the feedback provided by real users in order to understand the most appropriate quality metrics to apply.
CA01A02 Quality Assessment
In the previous phase, we identified the user requirements for her dataset and the particular use case she has in mind. This second phase involves the actual quality assessment based on the requirements. In particular, amongst the set of dimensions and metrics discussed in D3.2, the most relevant ones are selected. Thereafter, an evaluation of the quality of the dataset is performed using the metrics specific for each selected dimension. Thus, this task consists of three steps: (1) Statistical and Low-level analysis (2) Metadata assessment and (3) Qualitative and quantitative quality assessment.
CA01A02-01 Statistical and Low-level analysis
This step performs basic statistical and low-level analysis on the dataset. That is, generic statistics that can calculated automatically are included in this step. For example, the number of blank nodes pointing towards the completeness of the dataset or number of interlinks between datasets showcasing the interlinking degree of the dataset are calculated. After the analysis, generic statistics on the dataset based on certain pre-defined heuristics are calculated and provided to the user. The end result is a score indicating the value for each of the metrics assessed.
CA01A02-02 Metadata quality assessment
This step performs metadata quality assessment. Metadata play an important role in supporting metrics evaluation since they store aspects of data relevant to data quality. For example, this step checks that the temporal metadata such as the "Creation date" or "Valid Date", are present in the dataset. In general metadata quality assessment is limited to the assessment of the accuracy and completeness of the metadata. The identification may be done with the help of a checklist, which can be filled by the user. The checklist for assessing the quality of metadata can be found in Table 1 of D3.2
CA01A02-03 Qualitative and quantitative quality assessment
This step is composed by two parts that are, the qualitative and the quantitative assessment. Regarding the qualitative part, the major problems are highlighted by internal administrative users and final users of datasets. The internal administrative identify the causes of low data quality and further ask internal and final users for perceived and expected quality dimension levels.
This identification may be done with the help of a checklist, which can be filled by the user. The questions in the checklist implicitly refers to quality problems and their related quality dimensions. For example, questions such as whether the datasets provides a message board or a mailing list (pointing to the understandability dimension), are presented to the user. In this step, the user involvement is entirely manual where the user must have knowledge about the details of the dataset to answer these questions. Using this information, it is then possible to determine a set of relevant dimensions.
Regarding the quantitative part, it is possible to quantitatively assess quality after identifying relevant DQ dimensions and metrics. The assessment can be performed by checking the fulfilment of criteria in Table 2 in D3.2, measure data quality of datasets and identify their critical areas.
In order to assess the accuracy of data values, syntax patterns can be specified. The patterns may be defined by users through a proprietary language of tools such as GREL (Google Refine Expression Language) in OpenRefine. These patterns will capture incorrect values such as postal address, phone number, email address, personal identification number, etc.
Databases employ functional dependencies to capture accuracy and consistency quality issues. In alternative, accuracy and consistency can be assessed through a pattern-based approach which generates data quality tests of RDF knowledge bases. The tests are generated based on schema constraints or semi-automatically enriched schemata and allows users to generate specific tests for the dataset. The most commonly addressed assessment techniques perform the assessment based on reference values such as previous values from a dataset in the same domain or to gold standard values (values from the original source). This step is performed by comparing values from the transformed dataset to the gold standard values (i.e. values from the original source) or to a dataset in the same domain. For example, in case of measuring the population completeness of a dataset, it needs to be compared with the original dataset.
Measurement can be objective when it is based on quantitative metrics, or subjective, when it is based on qualitative evaluations by data owners and users.
CA01A03 Quality improvement
In the improvement step there are two general types of strategies, namely data-driven and process-driven. Data-driven strategies improve the quality of data by directly modifying the value of data. For example, obsolete data values are updated by refreshing a database with data from a more current database. Process-driven strategies improve quality by redesigning the processes that create or modify data. As an example, a process can be redesigned by including an activity that controls the format of data before storage.
Strategies, both data- and process-driven, apply a variety of techniques: algorithms, heuristics, and knowledge-based activities, whose goal is to improve data quality. A possible set of optimal improvement activities is:
- For each dataset, fix the new target DQ levels that can satisfy internal and external users.
- [Optional] Conceive process re-engineering activities that may lead to an improvement of DQ values evaluated during the quality assessment step, relating them in the process/dataset matrix to clusters of datasets involved in DQ improvement targets.
- Conceive data driven activities such as:
- acquisition of new data, which improves data by acquiring higher-quality data to replace the values that raise quality problems;
- standardization (or normalization), which replaces or complements nonstandard data values with corresponding values that comply with the standard. For example, nicknames are replaced with corresponding names, for example, Bob with Robert, and abbreviations are replaced with corresponding full names, for example, Channel Str. with Channel Street;
- record linkage, which identifies that data representations in two (or multiple) tables that might refer to the same real-world object;
- error localization and correction, which identify and eliminate data quality errors by detecting the records that do not satisfy a given set of quality rules. These techniques are mainly studied in the statistical domain. Compared to elementary data, aggregate statistical data, such as average, sum, max, and so forth are less sensitive to possibly erroneous probabilistic localization and correction of values. Techniques for error localization and correction have been proposed for inconsistencies, incomplete data, and outliers [Dasu and Johnson 2003]; [Batini and Scannapieco 2006].
- [Optional] add activities deriving from step 2.
- Check that identified activities/techniques/tools cover all DQ targets over all datasets, and in case complete them.
CA02A01 Identification of the potential user groups
Potential user groups might represent:
- application developers,
- journalist,
- citizens, individuals,
- public sector bodies,
- non-governmental organizations and civic associations - they might be especially relevant to public sector bodies as publisher, e.g. watchdog organizations,
- other organizations (non-application developers, e.g. banks),
- own employees of the data owner/publisher.
Various techniques might be applied for identification of the potential user groups. For example analysis of the web site traffic, analysis of the FOI requests or workshops or other events where the representatives of the publishers might get in touch with the different types of potential data users.
CA02A02 Definition of the communication strategy
Practices, recommendations and guidelines for definition of the communication strategy:
- Communication between publisher and users (community) must work in both ways. I.e. the communication strategy must cover both the publisher-to-users communication as well as users-to-publisher communication (user feedback). Feedback loop is essential for development of the open data initiative.
○ In order to ensure that the users are allowed to provide feedback and that this feedback can be utilized in maintenance of the datasets, at least one electronic channel should be established at the publishers website through which users can provide feedback about the published datasets. Services provided by a third party might be utilized as well as long as the publisher provides information how the users can access the service and provide the feedback.
- It is necessary to set competencies and responsibilities regarding the communication about open data (possibly as a part of the open data publication plan).
- Forms of the communication and the communication channels should be selected with the target user groups in mind.
- Application developers re-using the open datasets (individuals and organizations) represent an important user group. Information about availability of the datasets or about important changes in their availability (newly published datasets, changes in the means of access etc.) should be disseminated through channels commonly watched by the application developers or IT professionals, e.g. well-known portals or discussion fora.
- Channels watched by the open data community (local, national or international) should be used as well.
- On the publisher’s web site or in the data catalogue information for non-IT professionals should be provided. For this kind of users, it might be difficult to work with large datasets. Therefore, they should be informed that the published datasets are primarily intended for reuse by the application developers. If there are suitable documents available that can be easily understood by the non-IT audience links to these documents should be provided. Links to the applications using the provided datasets might be provided as well.
- The communication strategy should define what channels will be established through which the users will be able to provide feedback or ask questions. It must be ensured that the response will be appropriate and timely.
Communication regarding open data should not be one-way only. Therefore the open data publishers should provided means through which users might provide feedback to the re-users (Open Data Institute, 201?b). According to Open Data Institute (201?b) various contact points, e.g. for general data provision, errors in data or confidentiality concerns, might be established or a forum might be provided. Because the users might report back to the publisher about the identified errors in the published data, the user feedback might also support improvement of the dataset quality (Both and Schieferdecker, 2012). According to Janssen and Zuiderwijk (2012) electronic means of communication might lower the potential barriers for the user feedback.
When developing the open data communication strategy using social media might be taken into account (Open Knowledge Foundation, 2012). If the strategy or guidelines for using social media are already in place, open data related communication using the social medial should be harmonized with the existing guidelines. However detailed discussion of how to effectively use the social media for building engagement is beyond the scope of this methodology. See for example The Digital Engagement Guide for more information. You can also see (Krabina, Prorok and Lutz, 2012) for more information how to develop social media strategy supporting the Open Government.
CA02A03 Engaging users during development of the OD publication plan
The goal of this task is to engage users during development of the open data publication plan. The main objective of this interaction is to obtain information about demand for data. This information might be utilized during selection and prioritization datasets for opening up.
The following techniques might be used to engage users and obtain the information about the data in demand (provided list is not comprehensive):
- surveys, polls;
- online voting for datasets from a preselected list of candidate datasets or data domains;
- workshops;
- public discussions;
- conferences;
- other types of events.
CA02A04 Setting up the communication channels defined in the communication strategy
On the publisher’s web site or in the data catalogue a service should be provided through which news about the published datasets will be announced. For example RSS feed or a mailing list
CA02A05 Preparation of the communication campaign
Recommendations for the communication campaign:
- A press release should be prepared that will accompany the initial publication of the open dataset.
- Based on the analysis of the potential risk of misinterpretation of the published data an appropriate response should be prepared.
- If there are known data quality issues in the datasets that will not be fixed before the datasets are published informing about this quality issues should be included into the communication campaign. It should be also explained why the data is published having the quality issues (e.g. because it is not possible to easily correct the incorrect data or because the publisher wants to continuously improve the quality of the data based on the feedback provided by the users etc.).
CA02A06 Informing about progress
The goal of this task is to inform users and public in general about important achievements and progress in preparation of datasets for publication. This task should be executed especially when the preparations will take a significant amount of time. Because at this stage no open datasets might be available yet, from the outside it might seem that open data initiative is on hold. Therefore this tasks helps to keep users informed that the open data initiative is still ongoing.
When informing users about the progress general rules set by the communication strategy should be applied. During the task CA02A05 some form of a communication campaign might be prepared for this purpose.
Try to focus on important achievements that are likely to attract interest. Events like conferences, hackatons, meetups or open data application contests, even if organized by someone else, might be a good opportunity to remind the community that you will be publishing open data or new datasets soon.
CA02A07 Informing about open data
Users should be informed especially about:
- newly published or changed datasets,
- planned maintenance and down times of the systems that provide access to the published open datasets or on which run the data catalogue.
Events like workshops with the application developers or application challenges might be organized in order to foster reuse of the published datasets. You should also watch for this kind of events organized by someone else and join some of these events if appropriate.
CA02A08 Analysis of the user feedback
User feedback should be regularly evaluated and analysed. User feedback might indicate the demand for data, therefore it might help in prioritization of the dataset considered for opening up (frequently requested datasets should be ranked with higher priority). However the user feedback might be a valuable source of information about the errors in data or about the user experience of the data catalogue. Therefore the analysis of the user feedback should be used to improve the quality of the published datasets or to improve other aspects of the open data publication.
User feedback and the demand for data should be regularly evaluated and analysed. Demand for data might be identified for example by:
- analysis of the user feedback,
- analysis of requests pursuant to the Freedom of Information legislation,
- information gained during the workshops with the application developers, application challenges or similar events,
- analysis of open datasets published by the similar organizations,
- studies like the Open Data Barometer and the Open Data Index.
CA02A09 Informing about the termination of maintenance or publication of a dataset
Recommendations regarding informing about the termination of maintenance or publication of a dataset:
- Users should be informed about the upcoming termination of maintenance or publication of the datasets. Users should be informed in advance so that they have enough time to prepare.
- Information about the upcoming termination of maintenance or publication of the dataset should be published visibly on the publisher’s website or in the data catalogue. This information should be also announced through channels and services providing the news about the open datasets.
- If the concrete users of some specific datasets are known they should be informed directly.
- Reason for the termination of maintenance or publication of the datasets should be published as well.
- If there are datasets superseding the ones no longer published or maintained users should be informed about these datasets and where to find them.
Information about termination of maintenance or publication of a dataset might trigger responses from the users. Be prepared to handle them. Feedback obtained from the users should be analysed and based on this analysis improvements to the dataset maintenance/publication termination activities and information campaign might be proposed.
CA03A01 Identification and analysis of the potential risks
In this task risks related to publication of open data are identified, described and analysed. Probability and impact assessment of each of the identified risks are results of the risk analysis. Every identified risk together with the description and the information about its probability and impact should be recorded in the risk register.
The following risks might be relevant to publication of open data ((Kučera & Chlapek, 2014), the provided list is by no means comprehensive):
- Publication of data against the law - Publication of data that violates some legislations, i.e. it is prohibited by law or it infringes someone's rights or freedoms.
- Trade secret protection infringement - Publication of data that reveals some trade secrets that ought to be protected.
- Privacy infringement - Publication of personal data that ought to be protected.
- Risk to the security of the infrastructure - Detailed data about infrastructure (power plants, dams, transmitters etc.) might be misused to cause damage to the infrastructure.
- Publication of improper data or information - Publication of data that does not violate legislation but that might lead to a negative publicity or negative attitude of other public sector bodies.
- Publication of inaccurate data - People and organizations might provide incorrect data to the public sector bodies. As a consequence incorrect OGD might be published if datasets are derived from incorrect primary data.
- Misinterpretation of the data - Published data can be interpreted in different ways. Users might intentionally or unintentionally misinterpret the data (to cause scandal, to get competitive advantage, to cause harm to other subjects etc.)
- Absence of data consumers - There will be no consumers of the data because it will not be possible to locate the dataset or because nobody will find it interesting.
- Subjects less willing to cooperate - Published data about the results of the administrative supervision might bring negative publicity to those who do not comply with the legislation. These subjects might be then less willing to cooperate with the public sector bodies.
- Overlapping of data - Datasets might contain overlapping collections of data. More datasets on various websites might contain data on the same topic. If these datasets are inconsistent users might get confused.
- Increased number of requests for data - Increased number of published datasets might lead to an increased number of requests or questions about the published data or some related data.
Detailed recommendations for the risk analysis are provided below.
CA03A01-01 Analyse legislation and contractual obligations
Some of the open data related risks might have a form of violation of legislation or infringement of the contractual obligations or rights. Therefore relevant legislation as well as internal directives should be identified for every analysed dataset because they might constrain how the data can be published. Contractual obligations constraining the publication should be analysed as well. Especially in situations when the data was collected or processed by a third party.
CA03A01-02 Anonymization of the datasets
Publication of the dataset as open data might be constrained if the dataset contains some protected data like personal information or business secret. In such situations anonymization of the datasets should be considered.
CA03A01-03 Mitigation of the misinterpretation risk
Risk of misinterpretation should be analysed for every dataset. This risk might result from the complexity of the data or from its context or its ignorance respectively. This risk might be mitigated by publication of additional information to the dataset that will help the user to better understand what the data is about, what the possible limitations are, what the methodology is etc. Therefore it should be analysed what additional information should be provided for each candidate dataset for opening up. This information might be provided as a part of the catalogue record or the record might provide links to this information.
Mitigation of the risk of misinterpretation of the data should be performed in coordination with the Description of the datasets task (P02A03). Accurate description of the datasets might help to mitigate this risk. Alongside the links to the data sources (distributions) the catalogue record might contain links to documentation of the dataset. The above mentioned information explaining how to properly interpret the data might be provided as a part of this documentation. Therefore cataloguing of the open data might play an important role in mitigation of the misinterpretation risk.
CA03A02 Definition of the risk mitigation plan
It is necessary to set the competencies and responsibilities for the risk management and realization of the planned risk mitigation actions. These competencies and responsibilities might be set by the open data publication plan if the risk mitigation plan is a part of the open data publication plan.
For each risk in the risk register it is necessary to determine whether the risk should be avoided or if it will be mitigated or transferred, e.g. by insurance. Accepting risk is another option – loss is accepted if it occurs. Risk response should be cost effective, i.e. any actions taken to avoid or mitigate the risk should not cost more than the expected loss resulting from the impact and probability of the risk. Therefore expected impact and probability of the risk should be taken into account when deciding what strategy to risk mitigation will be taken.
Definition of the risk mitigation plan should be coordinated with definition of the benefits management plan (CA04A02) as well as the open data publication plan (P01A05) – if the benefits management plan and the risk mitigation plan are not parts of the open data publication plan. Benefits, risks and publication effort should be balanced. In order to balance these factors it is necessary to analyse benefits, risk and effort as well. Therefore definition of these plans should not be performed separately from each other but their definition should be aligned and harmonized.
CA03A03 Update of the risk register with the information acquired during the preparation of the datasets
Identified risks and their evaluation should be updated according to the information acquired during the preparation of the datasets.
CA03A04 Realization of the risk mitigation actions relevant to the preparation of the datasets
Relevant risk mitigation actions should be performed. For example approach to anonymization of the datasets containing protected data should be determined. During the description of the datasets additional information needed for correct understanding of the datasets should be collected as well.
CA03A05 Risk management (Realization of publication phase)
During the realization of publication phase it is necessary to manage the open data related risks. Maintenance of the risk register is one of the key activities in this phase. This might involve for example adding new risks and their evaluation, correction of the evaluation of the risks already identified or updates to the risk mitigation actions. The risk mitigation actions should be performed according to the risk mitigation plan.
CA03A06 Responding to events (Realization of publication phase)
Identified risks must be monitored. In case of the risk related events activities limiting the magnitude of the impact are performed according to the risk mitigation plan.
CA03A07 Reporting about the state of the open data related risks (Realization of publication phase)
During the realization of publication phase risk reporting should be performed towards the relevant stakeholders, especially the open data coordinator, owner and the published of the data.
CA03A08 Risk management (Archiving phase)
Risk register should be updated. Risks related to the datasets whose maintenance or publication was terminated should be newly evaluated. Value of the datasets which are no longer maintained might decrease a thus some of the risk mitigation actions might turn to be no longer cost effective. It should be also analysed whether the reason for termination of the maintenance or publication of some dataset results into new risks.
Unmaintained but still available datasets should be marked as unmaintained datasets in the catalogue in order to minimize the risk that the will be misinterpreted due to the unclear status of the dataset.
Termination of maintenance or publication of datasets might result into negative responses of the users. Therefore it is necessary to inform the users about the change in availability of the datasets and to explain why the maintenance or publication of the datasets is terminated. Risk management should be aligned with the activities of the communication management cross-cutting activity (CA02) in order to ensure that the users are properly informed.
CA03A09 Responding to events (Archiving phase)
Identified risks must be monitored. In case of the risk related events activities limiting the magnitude of the impact are performed according to the risk mitigation plan.
CA03A10 Reporting about the state of the open data related risks (Archiving phase)
Relevant stakeholders, especially the open data coordinator, owner and the published of the data, should be informed about the current risks and about the changes in the risk register that were made as a result of the termination of maintenance or publication of datasets.
CA04A01 Identification and analysis of the potential benefits
During the identification and analysis of the potential benefits the following steps should be performed:
- identification of the expected benefits of open data publication,
- determination what datasets must be published in order to achieve the identified benefits,
- quantification of the benefits if possible,
- qualitative description of the benefit if it is not possible to quantify its amount,
- definition of metrics and indicators for measuring the benefits.
If it is possible to quantify the benefit then the quantification should be performed. Amount of the benefit should be expressed in the financial terms. This allows comparison of benefits and risks and helps to align open data related risk management and benefits management. However if it is not possible to quantify a benefit, it should at least be described in a qualitative way.
Even though the list is not comprehensive the following benefits can be achieved by the open data initiatives (Kučera & Chlapek, 2014):
- Increased transparency
- Improved public relations and attitudes toward government
- Increased reputation of a public sector body
- Transparent way of informing the general public about infringement of legislation
- Improved government services
- Improved government data and processes
- Better understanding and management of data within public sector bodies
- Supporting reuse
- Increasing value of the data
- Stimulating economic growth
- Minimizing errors when working with government data
- Easier translations
- Less requests for data
Identified benefits are registered in the benefits register. This register also contains description of the benefits, their evaluation and identification of the related datasets.
CA04A02 Definition of the benefits management plan
It is necessary to set the competencies and responsibilities for the open data benefits management. These competencies and responsibilities might be set by the open data publication plan if the benefits management plan is a part of the open data publication plan.
Results of the identification and analysis of benefits of the open data publication (CA04A01) are utilized in development of the benefits management plan. Actions aimed at creation of the prerequisites for achievements of the identified benefits should be defined. Schedule for realization of the defined actions should be set as well as the responsibilities for their implementation. Review process for evaluation of the benefits and the review periodicity should be set in the plan as well.
Definition of the benefits management plan should be coordinated with definition of the risk mitigation plan (CA03A02) as well as the open data publication plan (P01A05) - if the benefits management plan and the risk mitigation plan are not parts of the open data publication plan. Benefits, risks and publication effort should be balanced. In order to balance these factors it is necessary to analyse benefits, risk and effort as well. Therefore definition of these plans should not be performed separately from each other but their definition should be aligned and harmonized.
CA04A03 Update of the benefits management plan according to the information acquired during the preparation of the datasets
During the preparation of publication phase datasets are described in more detail. As a consequence new information relevant to the open data benefits management might be obtained. Identified benefits and their evaluation should be updated according to the information acquired during the preparation of the datasets.
CA04A04 Benefits monitoring and management (Realization of publication phase)
Activities of this task ensure that the planned actions aimed at creation of the prerequisites for achievement of the defined benefits are implemented and that the identified benefits are continuously monitored and managed. Benefits register and the benefits management plan should be regularly updated. New actions supporting achievement of the benefits might be planned and subsequently implemented during this phase.
Management of benefits in this phase should be coordinated with activities of the Communication management cross-cutting activity (CA02). User feedback is acquired and analysed during the communication activities. This feedback might provide information significant for the assessment of the actual benefits realization. Seminars and workshop with application developers or open data application contests were mentioned as a way how open data publisher might interact with the users (CA02A07). Such events might server not only as way of communication and interaction but they might foster achievement of the expected open data benefits as well, e.g. the supporting reuse benefit. Therefore such actions might be defined in the communication strategy as well as in the benefits management plan.
CA04A05 Reporting about the open data related benefits (Realization of publication phase)
During the realization of publication phase reporting about benefits and their achievement should be performed towards the relevant stakeholders, especially the open data coordinator, owner and the published of the data.
CA04A06 Benefits monitoring and management (Archiving phase)
During the archiving phase benefits monitoring and management should focus on those benefits that are or that could be affected by the termination of maintenance or publication of datasets. Especially on those benefits that are dependent on publication of some particular dataset or datasets. When a decision to terminate maintenance/publication of a dataset is made benefits achieved by publication of this datasets should be evaluated if it is possible to assign some amount of benefits to a particular dataset. Impact of the decision to terminate maintenance/publication of datasets on the benefits should be evaluated and the benefits register as well as the benefits management plan should be updated.
CA04A07 Reporting about the open data related benefits (Archiving phase)
Relevant stakeholders, especially the open data coordinator, owner and the published of the data, should be informed about the current state of benefits and about the changes in the benefits register and the benefits management plan that were made as a result of the termination of maintenance or publication of datasets.
References
- BATINI, Carlo, CAPPIELLO, Cinzia, FRANCALANCI, Chiara, MAURINO, Maurino, 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3)
- BATINI, Carlo , SCANNAPIECO, Monica, 2006. Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications, Springer 2006, ISBN 978-3-540-33172-8.
- BOTH, Wolfgang and SCHIEFERDECKER, Ina. 2012. Berliner Open Data-Strategie. Organisatorische, rechtliche undtechnische Aspekte offener Daten in Berlin. Berlin: Fraunhofer Verlag, 2012. 172 pp. ISBN 978-3-8396-0368-0.
- Cabinet Office, 2013. National Information Infrastructure. In: data.gov.uk [online]. October 2013. [cit. 2014-03-11]. Available from:http://data.gov.uk/sites/default/files/library/20131112%20NII%20Narrative%20for%20publication%20FINAL.pdf
- Czech Republic, 2012. Akční plán České republiky „Partnerství pro otevřené vládnutí“. In: Vládní výbor pro koordinaci boje s korupcí [online]. 4th April 2012 [cit. 2014-07-09]. Available from:http://www.korupce.cz/assets/partnerstvi-pro-otevrene-vladnuti/Akcni-plan-OGP.pdf
- DULONG DE ROSNAY, Mélanie, TSIAVOS, Prodromos, ARTUSIO, Claudio, ELLI, Jo, RICOLFI, Marco, SAPPA, Cristiana, VOLLMER, Timothy, TARKOWSKI, Alek, 2014. D5.2. Licensing Guidelines. In: LAPSI 2.0 [online]. 25th February 2014 [cit. 2014-07-09]. Available from:http://www.lapsi-project.eu/sites/lapsi-project.eu/files/D5.2LicensingGuidelinesPO.pdf
- EIBL, Gregor, HÖCHTL, Johann, LUTZ, Brigitte, PARYCEK, Peter, PAWEL, Stefan, PIRKER, Harald, 2013. Framework for Open Government Data platforms. In: data.gv.at [online]. 28.3.2013 [cit. 2014-06-15]. Available from:http://www.data.gv.at/wp-content/uploads/2013/08/Framework__for_Open_Government_Data_Platforms_1.1.pdf
- ENGLISH, L. 1999. Improving Data Warehouse and Business Information Quality. Wiley & Sons.
- Executive Office of the President, 2013. Open Data Policy – Managing Information as an Asset. In: The White House [online] 9th May 2013. [cit. 2013-11-06]. Available from:http://www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf
- JANSSEN, Marijn a ZUIDERWIJK, Anneke, 2012. Open data and transformational government. Proceedings of the Transforming Government Workshop 2012 (tGov2012). 2012. May 8th – 9th 2012, Brunel University, London, University Kingdom.
- JEUSFELD, M.,QUIX, C., AND JARKE,M. 1998. Design and analysis of quality information for datawarehouses. In Proceedings of the 17th International Conference on Conceptual Modeling.
- KLESSMANN, Jens, DENKER, Philipp, SCHIEFERDECKER, Ina, SCHULTZ, Sönke E., 2012. Open Government Data Germany. Short Version of the Study on Open Government in Germany Commissioned by the Federal Ministry of the Interior. In: Federal Ministry of Interior [online]. July 2012 [cit. 2014-07-07]. Available from:http://www.bmi.bund.de/SharedDocs/Downloads/DE/Themen/OED_Verwaltung/ModerneVerwaltung/opengovernment_kurzfassung_en.pdf
- KRABINA, Krabina, PROROK, Thomas, LUTZ, Brigitte, 2012. Open Government Implementation Model. Version 2.0. In: KDZ - Zentrum für Verwaltungsforschung [online]. 30 November 2012 [cit. 2014-07-25]. Available from: http://www.kdz.eu/sites/default/files/documents/kdz/news/Open%20Government%20Implementation%20Model%20KDZ%20V2.0-final.pdf
- KUČERA, Jan, CHLAPEK, Dušan. Benefits and Risks of Open Government Data. Journal of systems integration [online], 2014, vol. 5, no. 1, pp. 30–41. ISSN 1804-2724. URL:http://www.si-journal.org/index.php/JSI/article/viewFile/185/136.
- KUČERA, Jan. 2014. Analýza a výběr dat k uveřejnění. Tutoriál “Otevírání a propojování dat” [Analysis and selection of datasets for publication. Tutorial “Opening up and interlinking of data”]. In: Česká společnost pro systémovou integraci [online]. 6.6.2014 [cit. 2014-07-05]. Available from:http://www.cssi.cz/cssi/system/files/all/Seminar_CSSI_6-6-2014_tutorial%20dopoledne.zip
- Logica Business Consulting, 2012. Open data and use of standards: Towards a Better Supply and Distribution Process for Open Data. In: Standardization Forum [online]. 2012 [cit. 2014-06-20]. Available from:http://www.forumstandaardisatie.nl/fileadmin/os/documenten/Internationale_benchmark_v1_03_final.pdf
- LOSHIN, D. 2004. Enterprise Knowledge Management - The Data Quality Approach. Series in Data Management Systems, Morgan Kaufmann, chapter 4.
- NEČASKÝ, Martin, KUČERA, Jan, CHLAPEK, Dušan, HANEČÁK, Peter, LACHMANN, Gabriel, RULA, Anisa, BEŇO, Peter, 2014. Deliverable D2.2: Criteria for the selection of datasets. In: Components Supporting the Open Data Exploitation [online]. 24th February 2014 [cit. 2014-07-06]. Available from:http://www.comsode.eu/wp-content/uploads/COMSODE_deliverable_2.2_final.pdf
- Open Knowledge Foundation, 2012. The Open Data Handbook. In: Open Data Handbook [online]. [cit. 2014-01-25]. Available from: http://opendatahandbook.org/
- Open Data Institute, 201?a. Placr. In: Open Data Institute [online]. [cit. 2014-07-07]. Available from:http://theodi.org/case-studies/placr-case-study
- Open Data Institute, 201?b. Engaging with reusers. In: Open Data Institute [online]. [cit. 2014-07-28]. Available from: http://theodi.org/guides/engaging-reusers
- Slovak Republic, 2012. Open Government Partnership Action Plan of the Slovak Republic. In: Open Government Partnership [online]. February 22, 2012 [cit. 2014-07-09]. Available from:http://www.opengovpartnership.org/file/853/download
- Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning. John Wiley 2003, ISBN 0-471-26851-8
[5] Let us note that there is a new W3C standard being developed for publishing CSV data on the Web. This standard will standardize constructs for recording URLs in CSV documents published on the Web. However, it is only in the initial phases of development and it cannot be recommended by this methodology yet. For more details about the development of the standard, see the WIKI page of the respective W3C working group : http://www.w3.org/2013/csvw/wiki/Main_Page
[13] http://w3c.github.io/csvw/metadata/. Let us note that this is a standard in development by W3C CSV on the Web Working Group. Therefore, there may appear changes to this standard before it is finalized and we recommend to monitor the development before using it.
[20] Although there might be restrictions to licence a dataset containg personal data due to data protection legislation, it is still possible to consider creating a derivative dataset which would not contain data protected by data protection legislation (anonymization) or obtaining a free consent from data subjects.