Page tree
Skip to end of metadata
Go to start of metadata

Terms and definitions that are used in the project concerning OpenData publication.

May be you could open it at the beginning of your work and keep it opened until you get used to its content.


When only desoriented by some abbreviation, have a  look right here, otherwise you find it also in the alphabetical list of terms.



Comma separated values (


Data Processing Unit - component in UnifiedViews that can execute a transformation of data  


Data Catalog Vocabulary (


extract, transform, and load (ETL) refers to a process that:

ODNOpen Data Node (

Web Ontology Language (


Resource Description Framework (


 SPARQL Protocol and RDF Query Language (


Uniform resource identifier (


Unified Views (


Unified Views core components (


Unified Views plugin components (DPUs) (


1*-5* schema for data formats

1* Available on the web (whatever format) but with an open licence, to be Open Data

2* Available as machine-readable structured data (e.g. excel instead of image scan of a table) / formats of tabular editors (ODS, XLS(X), etc.) or HTML. For textual documents ODT, DOC(X), etc..

3* as (2) plus non-proprietary format (e.g. CSV instead of excel, XML or JSON)

4* All the above plus URLs, use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff

5* All the above, plus: Link your data to other people’s data to provide context - use URLs to the other related data and datasets


Processing data that includes personal information so that individuals can no longer be identified in the resulting data. Anonymisation enables data to be published without breaching data protection principles. The principal techniques are aggregation and de-identification. These techniques minimise the risk of data-leakage that would result in individuals' privacy being compromised.


see Anonymisation


Application Programming Interface which is a set of definitions of the ways one piece of computer software communicates with another. It is a method of achieving abstraction, usually (but not necessarily) between higher-level and lower-level software. For data, this is usually a way provided by the data publisher for programs or apps to read data directly over the web. The app sends the API a query asking for the specific data it needs, e.g. the time of the next bus leaving a particular stop. This allows the app to use the data without downloading the whole dataset, saving bandwidth and ensuring that the data used is the most up-to-date available.

API Analytics

Rate limiting will be part of any API platform, without some sort of usage log and analytics showing developers where they stand, the rate limits will cause nothing but frustration. Clearly show developers where they are at with daily, weekly or monthly API usage and provide proper relief valves allowing them to scale their usage properly.

API Documentation

Quality API documentation is the gateway to a successful API. API documentation needs to be complete, yet simple–a very difficult balance to achieve. This balance takes work and will take the work of more than one individual on an API development team to make happen.

API documentation can be written by developers of the API, but additional edits should be made by developers who were not responsible for deploying the API. As a developer, it’s easy to overlook parameters and other details that developers have made assumptions about.

App / Application

A piece of software (short for 'application'), especially one designed to run on the [web](/glossary/en/terms/web/) or on mobile phones and similar platforms. Apps can make network connections to large databases and thus be a powerful way of consuming open data, which may be real-time, personalised, and (using a mobile phone's GPS) location-specific information. Application Library*Complete, functioning applications built on an API is the end goal of any API owner. Make sure and showcase all applications that are built on an API using an application showcase or directory. App showcases are a great way to showcase not just applications built by the API owner, but also showcase the successful integrations of ecosystem partners and individual developers.

Application Programming Interface



Acknowledging the source of data when using or re-publishing it. A data licence permitting the data to be used may include a requirement to attribute the source. Data subject to this restriction may still be considered open data according to the Open Definition.


The rate at which data can be transferred between computers. As bandwidth is limited, [apps](/glossary/en/terms/app-application) aim to download only the minimum amount of data needed to fulfil a user's request.

Basic Auth

Basic Auth is a way for a web browser or application to provide credentials in the form of a username and password. Because Basic Auth is integrated into HTTP protocol it is the easiest way for users to authenticate with a RESTful API.

Basic Auth is easily integrated, however if SSL is not used, the username and password are passed in plain text and can be easily intercepted on the open Internet.

Big Data

A collection of data so large that it cannot be stored, transmitted or processed by traditional means. The increasing availability of and need to process such datasets (for example, huge collections of weather or other scientific data) has led to the development of specialised computer technologies, architectures and programming languages.


BitTorrent is a protocol for distributing the bandwith for transferring very large files between the computers which are participating in the transfer. Rather than downloading a file from a specific source, BitTorrent allows peers to download from each other.


Data is available in bulk if the entire dataset can be downloaded easily and efficiently to a user’s own system. Conversely it is non-bulk if one is limited to getting small parts of the dataset, for example, are you restricted to a few elements of the data at a time and therefore require thousands or millions of requests to get the entire dataset. The provision of bulk access is a requirement of open data.


A catalog is a collection of datasets or web services.

Citizen Engagement

Actively involving the public in policy and decision-making. Citizen engagement is a central aim of open government, with the aims of improving decision making and gaining or retaining citizens’ consent and support. Open data is an essential tool for ensuring informed engagement.

Civic Hacking

Building tools and communities, usually online, that address particular civic or social problems. Examples could be tools that help users meet like-minded people locally based on particular interests, report broken infrastructure to their local council, or collaborate to clear litter from their neighbourhood. Local-level open data is particularly useful for civic hacking projects.


An open-source software platform for creating data portals, built and maintained by Open Knowledge. CKAN is used as the official data-publishing platform of around 20 national governments and powers many more local, community, scientific and other data portals. Notable features are configurable metadata, user-friendly web interface for publishers and data users, data preview, organisation-based authorisation levels, and APIs giving access to all features as well as data access.


Data stored ‘in the cloud’ is handled by a hosting company, relieving the data owner of the need to manage its physical storage. Instead of being stored on a single machine, it may be stored across or moved between multiple machines in different locations, but the data owner and users do not need to know the details. The hosting company is responsible for keeping it available and accessible via the internet.

Code Library

Working code samples in all the top programming languages are common place in the most successful APIs. Documentation will describe in a general way, how to use an API, but code samples will speak in the specific language of developers.


Connectivity relates to the ability for communities to connect to the Internet, especially the World Wide Web.

Content API

A web service that provides dynamic access to the page content of a website, includes the title, body, and body elements of individual pages. Such an API often but not always functions atop a Content Management System.


The process of automatically reading data in one file format and emitting the same data in a different format, thus making the data accessible to a wider range of applications.

A legal right over intellectual property (e.g. a book) belonging to the creator of the work. While individual data (facts) cannot be copyright, a database will in general be covered by copyright protecting the selection and arrangement of data within it. Within the European Union a separate ‘database right’ protects a database where there was a substantial effort in ‘obtaining’ the data. A copyright holder may use a licence to grant other people rights in the protected material, perhaps subject to specified restrictions.

Cost recovery

The principle of setting a price for a resource, e.g. data, aiming to recover the cost of collecting the data, as distinct from marginal cost. Data charged for on a cost-recovery basis is not open data according to the Open Definition. Studies show that charging for PSI on a cost-recovery basis leads to lower growth than free or marginal-cost pricing

Creative Commons

A non-profit organisation founded in 2001 that promotes re-usable content by publishing a number of standard licences, some of them open (though others include a non-commercial clause), that can be used to release content for re-use, together with clear explanations of their meaning.


Dividing the work of collecting a substantial amount of data into small tasks that can be undertaken by volunteers. Some examples: Wikipedia is a crowd-sourced encyclopedia and Galaxy Zoo was an early example of crowdsourcing scientific data, by asking non-expert volunteers to classify galaxies based on their visual appearance. NOVAM was a service which allowed the public to verify or correct official data on the locations of UK bus stops, crowdsourcing about 18,000 corrections.


A comma separated values (CSV) file is a computer data file used for implementing the tried and true organizational tool, the Comma Separated List. The CSV file is used for the digital storage of data structured in a table of lists form. Each line in the CSV file corresponds to a row in the table. Within a line, fields are separated by commas, and each field belongs to one table column. CSV files are often used for moving tabular data between two different computer programs (like moving between a database program and a spreadsheet program).

Data is represented in a plain text file, with each data row on a new line and commas separating the values on each row. As a very simple open format it is easy to consume and is widely used for publishing open data.


Catalog Service for the Web (CSW) is an API used by geospatial systems to provide metadata in open standards, including in the FGDC-endorsed ISO 19115 schema. The CSW-provided metadata can be mapped into the Project Open Data metadata schema.


A value or set of values representing a specific concept or concepts. Data may be thought of as unprocessed atomic statements of fact. It very often refers to systematic collections of numerical information in tables of numbers such as spreadsheets or databases. When data is structured, analyzed and possibly combined with other data in order to extract meaning, and to provide context and presented so as to be useful and relevant for a particular purpose, it becomes information for human apprehension. Data become “information” when. The meaning of data can vary depending on its context. Data includes all data. It includes, but is not limited to, 1) geospatial data 2) unstructured data, 3) structured data, etc.

Data Access Protocol

A system that allows outsiders to be granted access to databases without overloading either system.

Data Asset

A collection of data elements or datasets that make sense to group together. Each community of interest identifies the Data Assets specific to supporting the needs of their respective mission or business functions. Notably, a Data Asset is a deliberately abstract concept. A given Data Asset may represent an entire database consisting of multiple distinct entity classes, or may represent a single entity class.

Data cleaning

Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration), etc. See data quality.

Data collection

Datasets are created by collecting data in different ways: from manual or automatic measurements (e.g. weather data), surveys (census data), records of decisions (budget data) or ongoing transactions (spending data), aggregation of many records (crime data), mathematical modelling (population projections), etc.

Data format

See 1*-5* schema for data formats and Openness of the data

Data integration

Almost any interesting use of data will combine data from different sources. To do this it is necessary to ensure that the different datasets are compatible: they must use the same names for the same objects, the same units or co-ordinates, etc. If the data quality is good this process of data integration may be straightforward but if not it is likely to be arduous. A key aim of linked data is to make data integration fully or nearly fully automatic. Non-open data is a barrier to data integration, as obtaining the data and establishing the necessary permission to use it is time-consuming and must be done afresh for each dataset.

Data journalism

The ability to work with data is an increasingly important part of a journalist's armoury. Skills needed to research and tell a good data-based story include finding relevant data, data cleaning, exploring or mining the data to understand what story it is telling, and creating good visualisations.

Data leakage

If personal data has been imperfectly anonymised, it may be possible by piecing it together (perhaps with data available from other sources) to reconstruct the identity of some data subjects together with personal data about them. The personal data, which should not have been published (see [data protection](/glossary/en/terms/data-protection-legislation) ), may be said to have 'leaked' from the 'anonymised' data. Other kinds of confidential data can also be subject to leakage by, for example, poor data security measures. See de-identification.

Data management

The policies, procedures, and technical choices used to handle data through its entire lifecycle from data collectionto storage, preservation and use. A data management policy should take account of the needs of data quality, availability, data protection, data preservation, etc.

Data page

A hub for data discovery which provides a common location that lists and links to an organization’s datasets.

Data portal

A web platform for publishing data. The aim of a data portal is to provide a data catalogue, making data not only available but discoverable for data users, while offering a convenient publishing workflow for publishing organisations. Typical features are web interfaces for publishing and for searching and browsing the catalogue, machine interfaces (APIs) to enable automatic publishing from other systems, and data preview and visualisation.

Data preservation

The Domesday Book of 1086 was written with ink on vellum, a technology that is still legible today. Long-term preservation of present day datasets is more difficult to ensure owing to uncertainty about the future of file formats, computer architectures, storage media and network connectivity. Projects that put particular stress on data preservation take a variety of approaches to solving these problems.

Data Processing Unit

Basic element of the UnifiedViews - ETL (extract - transform - load) tool for ODN. Each DPU executes one ETL task and is to be assembled in the series, called pipeline. See also pipeline.

Data protection legislation

Data protection legislation is not about protecting the data, but about protecting the right of citizens to live without fear that information about their private lives might become public. The law protects privacy (such as information about a person’s economic status, health and political position) and other rights such as the right to freedom of movement and assembly. For example, in Finland a travel card system was used to record all instances when the card was shown to the reader machine on different public transport lines. This raised a debate from the perspective of freedom of movement and the travel card data collection was abandoned based on the data protection legislation.

Data quality

A measure of the useableness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset.


A collection of data stored according to a schema and manipulated according to the rules set out in one Data Modelling Facility.

Database rights

A right to prevent others from extracting and reusing content from a database. Exists mainly in European jurisdictions.


Any organised collection of data. ‘Dataset’ is a flexible term and may refer to an entire database, a spreadsheet or other data file, or a related collection of data resources.

The most basic representation of a dataset is data elements presented in tabular form. Each column represents a particular variable. Each row corresponds to a given value of that column’s variable. A dataset may also present information in a variety of non-tabular formats, such as an extensible mark-up language (XML) file, a geospatial data file, or an image file, etc.

          (i) Any organised collection of data may be considered a database. In this sense the word is synonymous with dataset.

          (ii) A software system for processing and managing data, including features to extend or update, transform and query the data. Examples are the open source PostgreSQL, and the proprietary Microsoft Access.


A form of [anonymisation](/glossary/en/terms/anonymisation/) where personal records are kept intact but specific identifying information, such as names, are replaced with anonymous [identifiers](/glossary/en/terms/identifier). Compared to aggregation, de-identification carries a greater risk of [data leakage](/glossary/en/terms/data-leakage/): for example, if prison records include a prisoner's criminal record and medical history, the prisoner could in many cases be identified even without their name by their criminal record, giving unauthorised access to their medical history. In other cases this risk is absent, or the value of the un-aggregated data is so great that it is worth making de-identified data available subject to carefully designed safeguards.

Developer page

A hub for API discovery which provides a common location where an organization’s APIs and their associated documentation.


An ordinary table or spreadsheet can easily represent two data dimensions: each data point has a row and a column. Plenty of real-world data has more dimensions, however: for example, a dataset of Earth surface temperature varying with position and time (two co-ordinates are required to specify the position on earth, e.g. latitude and longitude, and one to specify the time).


It is not enough for open data to be published if potential users cannot find it, or even do not know that it exists. Rather than simply publishing data haphazardly on websites, governments and other large data publishers can help make their datasets discoverable by indexing them in catalogues or {data portals}.


Digital Object Identifier, an identifier for a digital object (such as a document or dataset) that is assigned by a central registry and is therefore guaranteed to be a globally unique identifier: no two digital objects in the world will have the same DOI.


see Data Processing Unit.


An association between a binding and a network address, specified by a URI, that may be used to communicate with an instance of a service. An end point indicates a specific location for accessing a service using a specific protocol and data format.

Error Response Code

Errors are an inevitable part of API integration, and providing not only a robust set of clear and meaningful API error response codes, but a clear listing of these codes for developers to follow and learn from is essential.

API errors are directly related to frustration during developer integration, the more friendlier and meaningful they are, the greater the chance a developer will move forward after encountering an error. Put a lot of consideration into your error responses and the documentation that educates developers.


Extract - Transform - Load the data tool- essential for working with datasets In ODN UnifiedViews.

EU PSI Directive

The Directive on the re-use of public sector information, 2003/98/EC. “deals with the way public sector bodies should enhance re-use of their information resources.” Legislative Actions - PSI Directive

File format

The description of how a file is represented on a computer disk. The format usually corresponds to the last part of the file name (‘extension’), e.g. a file in CSV format might be called schools-list.csv. The file format refers to the internal format of the file, not how it is displayed to users. E.g. CSV and XLS files are structured very differently on disk, but may look similar or identical when opened in a spreadsheet program such as Excel.

Five stars of open data

A rating system for open data proposed by Tim Berners-Lee, founder of the World Wide Web. To score the maximum five stars, data must (1) be available on the Web under an open licence, (2) be in the form of structured data, (3) be in a non-proprietary file format, (4) use URIs as its identifiers (see also RDF), (5) include links to other data sources (see linked data). To score 3 stars, it must satisfy all of (1)-(3), etc.

See also 1*-5* schema for data formats.

Freedom of Information

Also known as FOI. A requirement in law (e.g. the Freedom of Information Act 2000 in the UK or the Right to Information Act 2005 in India) for public bodies to provide data held by them to citizens on request, unless a specific exemption applies, e.g. the data is confidential. The fact that information must be supplied under FoI laws does not in general make it open data, as it is not distributed, may not be available under an open licence, etc.


Any dataset where data points include a location, e.g. as latitude and longitude or another standard encoding. Maps, transport routes, environmental data, catastral data, and many other kinds of data can be published as geodata.


A dialect of JSON with specialised features for describing geodata, and hence a popular interchange format for geodata.


Geographical Information System, any computer system designed to read, display, analyse and manipulate geodata.


GitHub is a social coding platform allowing developers to publicly or privately build code repositories and interact with other developers around these repositories–providing the ability to download or fork a repository, as well as contribute back, resulting in a collaborative environment for software development.

"Good" CSV

CSV with only comas as separation signs.

Government data 

The work of government involves collecting huge amounts of data, much of which is not confidential (economic data, demographic data, spending data, crime data, transport data, etc). The value of much of this data can be greatly enhanced by releasing it as open data, freeing it for re-use by business, research, civil society, data journalists, etc.


The Global Positioning System, a satellite-based system which provides exact location information to any equipment with a suitable receiver (including modern smartphones). GPS is invaluable to many location-based apps, providing users with e.g. route-finding information or weather forecasts based on their current location. GPS is also a striking example of successful open data, as it is maintained by the US government and provided free of charge to anyone with a GPS receiver.


A company that stores a customer's data on its own (the host's) computers and makes it available over the [internet](/glossary/en/terms/internet/). A hosted service is one that runs and stores data on the service-provider's computers and is accessed over the network. See also SaaS.

Human Readable

Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.


The name of an object or concept in a database. An identifier may be the object’s actual name (e.g. ‘London’ or ‘W1 1AA’, a London postcode), or a word describing the concept (‘population’), or an arbitrary identifier such as ‘XY123’ that makes sense only in the context of the particular dataset. Careful choice of identifiers using relevant standards can facilitate data integration. See linked data.


Information, as defined in OMB Circular A-130, means any communication or representation of knowledge such as facts, data, or opinions in any medium or form, including textual, numerical, graphic, cartographic, narrative, or audiovisual forms.

It´s a structured collection of data presented in a form that people can understand and process. Information is converted into knowledge when it is contextualised with the rest of a person’s knowledge and world model.

Information Asset Register

IARs are registers specifically set up to capture and organise meta-data about the vast quantities of information held by government departments and agencies. A comprehensive IAR includes databases, old sets of files, recent electronic files, collections of statistics, research and so forth.

It is essential that the metadata in the IARs should be comprehensive so that search engines can function effectively. In the spirit of open government data, public bodies should make their IARs available to the general public as raw data under an open license so that civic hackers can make use of the data, for example by building search engines and user interfaces.

Information Life Cycle

Information life cycle, as defined in OMB Circular A-130, means the stages through which information passes, typically characterized as creation or collection, processing, dissemination, use, storage, and disposition.

Information System

Information system, as defined in OMB Circular A-130, means a discrete set of information resources organized for the collection, processing, maintenance, transmission, and dissemination of information, in accordance with defined procedures, whether automated or manual.

Information System Life Cycle

Information system life cycle, as defined in OMB Circular A-130, means the phases through which an information system passes, typically characterized as initiation, development, operation, and termination.

Intellectual property rights

Monopolies granted to individuals for intellectual creations.


A structured collection of data presented in a form that people can understand and process. Information is converted into knowledge when it is contextualised with the rest of a person’s knowledge and world model.

IP rights

See Intellectual property rights.


jSON - JavaScript Object Notation, a simple but powerful  lightweight data-interchange format for data. It can describe complex data structures, is highly machine-readable as well as reasonably human-readable, and is independent of platform and programming language, and is therefore a popular format for data interchange between programs and systems.

It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.


JSONP or “JSON with padding” is a JSON extension wherein the name of a callback function is specified as an input argument of the underlying JSON call itself. JSONP makes use of runtime script tag injection.


Keyhole Markup Language, an XML-based open format for geodata. KML was devised for Keyhole Earth Viewer, later acquired by Google and renamed Google Earth, but has been an international standard of the Open Geospatial Consortium since 2008.


The sum of a person’s - or mankind’s - information about and ability to understand the world. See also data


A legal instrument by which a copyright holder may grant rights over the protected work. Data and content is open if it is subject to an explicitly-applied licence that conforms to the Open Definition. A range of standard open licences are available, such as the Creative Commons CC-BY licence, which requires only attribution.

Licence mixing

If Project X publishes content, and wants to include content from Project Y, it is necessary that Y’s licence permits at least the same range of re-uses as X’s licence. For example, content published under a non-commercial licence cannot be included in Wikipedia, since Wikipedia’s open licence includes rights for commercial re-use which cannot be granted for the non-commercial data, an example of a failure of licences to mix well.

Linked data

A form of data representation where every identifier is an http://… URI, using standard lists (see vocabulary) of identifiers where possible, and where datasets include links to reference datasets of the same objects. A key aim is to make data integration automatic, even for large datasets. Linked data is usually represented using RDF. See also five stars of open data; triple store.


A number of links can be grouped into a linkset that connects to an adjacent point.

Machine readable

Data in a data format that can be automatically and easily processed by a computer without human intervention while ensuring no semantic meaning is lost. Formats are read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable.

Note: The appropriate machine readable format may vary by type of data - so, for example, machine readable formats for geographic data may differ from those for tabular data.

Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable.

Many eyes principle

If something is visible to many people then, collectively, they are more likely to find errors in it. Publishing open data can therefore be a way to improve its accuracy and data quality, especially where a good interface for reporting errors is provided. See crowdsourcing.


To facilitate common understanding, a number of characteristics, or attributes, of data are defined. Information about a dataset such as its title and description, method of collection, author or publisher, area and time period covered, licence, date and frequency of release, etc. It is essential to publish data with adequate metadata to aid both discoverability and usability of the data.

These characteristics of data are known as “metadata”, that is, “data that describes data.” For any particular datum, the metadata may describe how the datum is represented, ranges of acceptable values, its relationship to other data, and how it should be labeled. Metadata also may provide other relevant information, such as the responsible steward, associated laws and regulations, and access management policy. Each of the types of data described above has a corresponding set of metadata. Two of the many metadata standards are the Dublin Core Metadata Initiative (DCMI) and Department of Defense Discovery Metadata Standard (DDMS).

The metadata for structured data objects describes the structure, data elements, interrelationships, and other characteristics of information, including its creation, disposition, access and handling controls, formats, content, and context, as well as related audit trails. Metadata includes data element names (such as Organization Name, Address, etc.), their definition, and their format (numeric, date, text, etc.). In contrast, data is the actual data values such as the “US Patent and Trade Office” or the “Social Security Administration” for the metadata called “Organization Name”. Metadata may include metrics about an organization’s data including its data quality (accuracy, completeness, etc.).


MidPoint is a tool for Identity Management, allowing user only to log once when working with the whole ODN system, using all its components, normally requesting one account and credentials each. MidPoint contributes to the Single-Sign-On (SSO) feature of the ODN.


Non-governmental organisation. NGOs are voluntary, non-profit organisations focussing on charitable work, community-building, campaigning, research, etc, making up a vital part of civil society.

Non commercial

A restriction, as part of a [licence](/glossary/en/terms/licence/), that content cannot be freely re-used for 'commercial' purposes. Content or data subject to a non-commercial restriction is not open, according to the Open Definition. Such a restriction reduces economic value and causes problems with licence mixing, as well as often ruling out more than is intended (for example, it is often unclear whether educational uses are 'commercial'). The intent of a non-commercial clause may be better captured by a share-alike requirement.


An open standard for authorization. It allows users to share their private resources stored on one site with another site without having to hand out their credentials, typically username and password.


Open Database Licence, an attempt to create an open licence for data which covers the 'database right' (see copyright) as well as copyright itself. It does this by imposing contractual obligations on the data re-user. Unfortunately contract law is fundamentally different from copyright law, since copyright is inherent in a work and binds all downstream users of the work, whereas a contract only binds the parties to the contract and has no force on a later re-user of re-published data. The ODbL remains useful nevertheless, and other attempts are being made to create open licences specifically for data.


Open Data Readiness Assessment, a framework created by the World Bank for assessing the opportunities, obstacles and next steps to be taken in a country (especially a developing country) considering publishing government data as open data. See development data.


The Open Government Partnership, a partnership of national governments launched in 2011 with the aim of promoting open government in the member countries and collaborating on multi-lateral agreements and best practice. At the time of writing (2014) there are 64 participating countries.

Open Access

The principle that access to the published papers and other results of research, especially publicly-funded research, should be freely available to all. This contrasts with the traditional model where research is published in journals which charge subscription fees to readers. Besides benefits similar to the benefits of open data, proponents suggest that it is immoral to withhold potentially life-saving and valuable research from some readers who may be able to use or build on it. Open-access journals now exist and the interest of research funders is giving them some traction, especially in the sciences.

Open Data

Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose - subject only, at most, to requirements to provide attribution and/or share-alike.

Open Data is any data published in the Internet that is:

  1. complete,
  2. easily accessible,
  3. machine-readable,
  4. using commonly owned (open) standards,
  5. available under explicitly stated terms of use (license) which allows its reuse with minimal restrictions,
  6. available to the potential users for minimal possible costs.

Open Data should also be:

  1. primary,
  2. timely,
  3. non-discriminating,
  4. permanent.

Specifically, open data is defined by the Open Definition and requires that the data be

A. Legally open: that is, available under an open (data) license that permits anyone freely to access, reuse and redistribute

B. Technically open: that is, that the data be available for no more than the cost of reproduction and in machine-readable and bulk form.

Adapted from:

Open definition

The Open Definition, first released by Open Knowledge in 2005, sets out under what conditions data and content is open. The “standard” provided by the Open Definition is crucial because much of the value of open data lies in the ease with which different sources of open data can be combined. Both legal and technical compatibility is vital, and the Open Definition ensures that openly-licensed data can be combined successfully, avoiding a proliferation of licences and terms of use for open data leading to complexity and incompatibility. As governments and organisations jostle to wear the 'open' label, the Open Definition ensures that the term does not lose its meaning amid the hype. Today it is the main international standard for open data and open data licences, with an advisory council of senior open data practitioners and can be found at The expert-governed licence conformance process and recommendations for conformance have strengthened licences around the world, for example, in the revision of the UK Government’s internationally influential “Open Government Licence”. The Open Definition has also influenced and steered other communities of practice in the open movement, including open access to publicly-funded research, open hardware, and more. See [open data](/glossary/en/terms/open-data/) for a summary.

Open Development

Open development seeks to bring the philosophy of the open movement ( to international development. It promotes open government, transparency of aid flows, engagement of beneficiaries in the design and implementation of development projects, and availability and use of open development data.

Open format

A file format whose structure is set out in agreed standards, overseen and published by a non-commercial expert body. A file format with no restrictions, monetary or otherwise, placed upon its use and can be fully processed with at least one free/libre/open-source software tool. Patents are a common source of restrictions that make a format proprietary. Often, but not necessarily, the structure of an open format is set out in agreed standards, overseen and published by a non-commercial expert body. A file in an open format enjoys the guarantee that it can be correctly read by a range of different software programs or used to pass information between them. Compare proprietary.

Open government

Open government, in line with the open movement generally, seeks to make the workings of governments transparent, accountable, and responsive to citizens. It includes the ideals of democracy, due process, citizen participation and open government data. A thorough-going approach to open government would also seek to enable citizen participation in, for example, the drafting and revising of legislation and budget-setting. See OGP.

Openlink Virtuoso

The only RDBMS (tabular data storage) implementation used by UnifiedViews.

Open movement

The open movement seeks to work towards solutions of many of the world’s most pressing problems in a spirit of transparency, collaboration, re-use and free access. It encompasses open data, open government, open development, open science and much more. Participatory processes, sharing of knowledge and outputs and open source software are among its key tools. The specific definition of “open” as applied to data, knowledge and content, is set out by the Open Definition.

Open Science

The practice of science in accordance with open principles, including open access publishing, publication of and collaboration around research data as open data together with associated source code, and use and development of open source data processing tools.

Open Source

Software for which the source code is available under an open licence. Not only can the software be used for free, but users with the necessary technical skills can inspect the source code, modify it and run their own versions of the code, helping to fix bugs, develop new features, etc. Some large open source software projects have thousands of volunteer contributors. The Open Definition was heavily based on the earlier Open Source Definition, which sets out the conditions under which software can be considered open source

Open Source Software*Computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under an open-source license that permits users to study, change, improve and at times also to distribute the software.

Open source software is very often developed in a public, collaborative manner. Open source software is the most prominent example of open source development and often compared to (technically defined) user-generated content or (legally defined) open content movements.

Open standards

A standard developed or adopted by voluntary consensus standards bodies, both domestic and international. These standards include provisions requiring that owners of relevant intellectual property have agreed to make that intellectual property available on a non-discriminatory, royalty-free or reasonable royalty basis to all interested parties.

Can also be interpreted to mean standards which are developed in a vendor-neutral manner.

Openness of the data

Determines the target level of openness according to 1*-5* schema for each dataset. The minimal level is 3*.

It is possible to determine 2* level in special cases when the dataset exists only in a form of non-structured documents and it is not possible for the organization to convert them to a structured form.

Choose one or more data formats for publishing each dataset on the base of the determined level of openness:

  • For 2* level, we recommend formats of tabular editors (ODS, XLS(X), etc.) or HTML. In case of textual documents (e.g. public contracts agreements), it is possible to choose formats of text editors (ODT, DOC(X), etc.).
  • For 3* level, we recommend CSV, XML or JSON.
  • For 4* level, it is possible to choose RDF which is also a recommended format for 5* level. It is also possible to choose CSV, XML or JSON. The key characteristics of 4* level in comparison to 3* level is that entities in the dataset are identified by URLs so that it is possible to transparently reference them from other datasets. P02A03-04 and P02A03-05 provide recommendations for designing the URLs. We recommend to record the URLs in CSV, XML and JSON formats in the following way:
    • For CSV, we recommend to add a new column for entity URLs. The column should be placed besides the already existing column for entity identifiers.
    • For XML, we recommend to record URLs using an extension of HTML and XML documents called RDFa (XML attribute resource).
    • For JSON, we recommend to record URLs using an extension of JSON called JSON-LD (construction @id).
  • For 5* level, i.e. linked open data represented in RDF model, we recommend TTL. The key characteristics of 5* level in comparison to 4* level is that we do not provide only entity identifiers in the form of URLs but we also link them to URLs of other related entities in other datasets. It is also necessary to ensure that each URL of our published entities is dereferenceable, i.e. that a client application receives a machine readable representation of the entity in RDF model when it accesses the URL.


CKAN Organizations are used to create, manage and publish collections of datasets. Users can have different roles within an Organization, depending on their level of authorisation to create, edit and publish.


A special kind of variable, used in a subroutine to refer to one of the pieces of data provided as input to the subroutine. The semantics for how parameters can be declared and how the arguments get passed to the parameters of subroutines are defined by the language, but the details of how this is represented in any particular computer system depend on the calling conventions of that system.


Portable Document Format, a file format for representing the layout and appearance of documents on a page independent of the layout software, computer operating system, etc. Originally a proprietary format of Adobe Systems, PDF has been an open format since 2008. Data in PDF files is not machine-readable; see structured data.


The series of DPUs (Data Processing Units) in UnifiedViews. The objective of the pipeline is to execute the given task of ETL (extract - transform - load). There are some ready-made pipelines but Admin and Publisher can built (create) new pipelines selecting DPUs from the list using the UnifiedViews canvas.


The right of individuals to a private life includes a right not to have personal information about themselves made public. A right to privacy is recognised by the Universal Declaration of Human Rights and the European Convention on Human Rights. See data protection.


  1. Proprietary software is owned by a company which restricts the ways in which it can be used.Users normally need to pay to use the software, cannot read or modify the [source code](/glossary/en/terms/source-code/), and cannot copy the software or re-sell it as part of their own product. Common examples include Microsoft Excel and Adobe Acrobat. Non-proprietary software is usually open source.

  2. A proprietary file format is one that a company owns and controls. Data in this format may need proprietary software to be read reliably. Unlike an  open format, the description of the format may be confidential or unpublished, and can be changed by the company at any time. Proprietary software usually reads and saves data in its own proprietary format. For example, different versions of Microsoft Excel use the proprietary XLS and XLSX formats.

Public domain

Content to which copyright does not apply, for example because it has expired, is free for any kind of use by anyone and is said to be in the public domain. CC0, one of the licences of Creative Commons, is a ‘public domain dedication’ which attempts so far as possible to renounce all rights in the work and place it in the public domain.


Anyone who distributes and makes available data or other content. Data publishers include government departments and agencies, research establishments, NGOs, media organisations, commercial companies, individuals, etc.


A type of question accepted by a database about the data it holds. A complex query may ask the database to select records according to some criteria, aggregate certain quantities across those records, etc. Many databases accept queries in the specialised language SQL or dialects of it. A web API allows an app to send queries to a database over the web. Compared with downloading and processing the data, this reduces both the computation load on the app and the bandwidth needed.

Raw data

The original data, in machine-readable form, underlying any application, visualisation, published research or interpretation, etc.

RDBMS data mart

A tabular data store, where data is stored when data publisher wants to prepare CSV dumps of the published dataset or provide REST API for data consumers.

ODN/Storage will use SQL relational database (such as MySQL, PostgreSQL, etc.) for storing tabular data.

Every transformation pipeline can contain one or more Tabular data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDBMS data mart. Every loader loads data into a single table. The name for the table is prepared by ODN/UnifiedViews and is based on the dataset ID and  ID of the tabular data mart loader DPU.

Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which tables should be published by (1) manually specifying all the tables which should be published or by (2) specifying that all results of certain transformation pipeline should be published.

To support the above feature, data being stored to RDBMS data mart must be associated with metadata holding for every table at least:

  • to which dataset the table belongs to

  • which transformation pipeline produced the table


Resource Description Framework, the native way of describing linked data, a family of specifications for a metadata model. RDF is not exactly a data format; rather, there are a few equivalent formats in which RDF can be expressed, including an XML-based format. RDF data takes the form of ‘triples’ (each atomic piece of data has three parts, namely a subject, predicate and object), and can be stored in a specialised database called a triple store.

The RDF family of specifications is maintained by the World Wide Web Consortium (W3C). The RDF metadata model is based upon the idea of making statements about resources in the form of a subject-predicate-object expression…and is a major component in what is proposed by the W3C’s Semantic Web activity: an evolutionary stage of the World Wide Web in which automated software can store, exchange, and utilize metadata about the vast resources of the Web, in turn enabling users to deal with those resources with greater efficiency and certainty. RDF’s simple data model and ability to model disparate, abstract concepts has also led to its increasing use in knowledge management applications unrelated to Semantic Web activity.

RDF data mart

In ODN, data is stored in RDF data mart when data publisher wants to prepare for data consumers RDF dumps of the published dataset or provide SPARQL endpoint on top of the published dataset.

Every transformation pipeline can contain one or more RDF data mart loaders - DPUs, which load data resulting from the transformation pipeline to RDF data mart. Every RDF data mart loader loads data to a single RDF graph. RDF graph represents a context for RDF triples, graph is a collection of RDF triples produced by one RDF data mart loader. The name for the RDF graph is prepared by ODN/UnifiedViews and is based on the dataset ID and  ID of the RDF data mart loader DPU.

Since every published dataset may require more then one transformation pipeline, and not all results of every transformation pipeline should be published by ODN/Publication module, data publisher may decide which RDF graphs should be published by (1) manually specifying all the graphs which should be published or by (2) specifying that results of certain transformation pipeline should be published.

To support the above feature, data being stored to RDF data mart must be associated with metadata holding for every RDF data graph at least:

  • to which dataset the graph belongs to

  • which transformation pipeline produced the graph.

Every transformation pipeline (ODN/UnifiedViews) can contain one or more RDF/RDBMS data mart loaders - DPUs, which load data resulting from the transformation pipeline to the corresponding data mart (RDF/RDBMS).

ODN/Publication uses data marts to get required graphs/tables to be published (exported as RDF/CSV dumps, made available via REST API/SPARQL Endpoint). ODN/Publication selects the relevant graphs/tables based on the data publishers preference and metadata associated with tables/graphs.

Real time

Data (such as the current location of trains on a network) which is being constantly updated, where a query needs to be against the latest version of the data.

Research data

Experimental research in the sciences and social sciences produces large quantities of data. Research data management (RDM) is an emerging discipline that seeks best practices in handling this. Traditionally the data was kept by researchers and only final research outputs, such as papers analysing the data, would be published. Open science holds that the data should be published, both to increase verifiability of the work and to enable it to be used in other research. The full spirit of open science collaboration demands data publication early in the project, but research culture will need to change appreciably before this becomes widespread.


CKAN uses this term to denote one of the individual data objects (a file such as a spreadsheet, or an API in a dataset).


A style of software architecture for distributed systems such as the World Wide Web. REST has emerged as a predominant Web service design model. REST facilitates the transaction between web servers by allowing loose coupling between different services. REST is less strongly typed than its counterpart, SOAP. The REST language is based on the use of nouns and verbs, and has an emphasis on readability. Unlike SOAP, REST does not require XML parsing and does not require a message header to and from a service provider. This ultimately uses less bandwidth.


It is rare that data gathered for a particular purpose does not have other possible uses. Happily, data is an infinite resource (see [tragedy of the anti-commons](/glossary/en/terms/tragedy-of-the-anti-commons/)); once gathered, for whatever reason, it can be re-used again and again, in ways that were never envisaged when it was collected, provided only that the data-holder makes it available under an open licence to enable such re-use.


A family of web feed formats (often dubbed Really Simple Syndication) used to publish frequently updated works — such as blog entries, news headlines, audio, and video — in a standardized format. An RSS document (which is called a “feed,” “web feed,” or “channel”) includes full or summarized text, plus metadata such as publishing dates and authorship.


Software as a Service, i.e. a software program that runs, not on the user's machine, but on the machines of a [hosting](/glossary/en/terms/host) company, which the user accesses over the [web](/glossary/en/terms/web/). The host takes care of associated data storage, and normally charges for the use of the service or monetises its client base in other ways.


An XML schema defines the structure of an XML document. An XML schema defines things such as which data elements and attributes can appear in a document; how the data elements relate to one another; whether an element is empty or can include text; which types of data are allowed for specific data elements and attributes; and what the default and fixed values are for elements and attributes. A schema is also a description of the data represented within a database. The format of the description varies but includes a table layout for a relational database or an entity-relationship diagram. It is method for specifying constraints on XML documents.


Extracting data from a non-machine-readable source, such as a website or a PDF document, and creating structured data from the result. Screen-scraping a dataset requires dedicated programming and is expensive in programmer time, so is generally done only after all other attempts to get the data in structured form have failed. Legal questions may arise about whether the scraping breaches the source website’s copyright or terms of service.


Software Development Kits (SDK) are the next step in providing code for developers, after basic code samples. SDKs are more complete code libraries that usually include authentication and production ready objects, that developers can use after they are more familiar with an API and are ready for integration.

Just like with code samples, SDKs should be provided in as many common programming languages as possible. Code samples will help developers understand an API, while SDKs will actually facilitate their integration of an API into their application. When providing SDKs, consider a software licensing that gives your developers as much flexibility as possible in their commercial products.


A computer on the internet, usually manged by a hosting company, that responds to requests from a user, e.g. for web pages, downloaded files or to access features in a SaaS package being run on the server.


Expresses a software architectural concept that defines the use of services to support the requirements of software users. In a SOA environment, nodes on a network make resources available to other participants in the network as independent services that the participants access in a standardized way. Most definitions of SOA identify the use of Web services (using SOAP and WSDL) in its implementation. However, one can implement SOA using any service-based technology with loose coupling among interacting software agents.


A popular file format for geodata, maintained and published by Esri, a manufacturer of GIS software. A Shapefile actually consists of several related files. Though the format is technically proprietary, Esri publish a full specification standard and Shapefiles can be read by a wide range of software, so function somewhat like an open standard in practice.

Share-alike License

A license that requires users of a work to provide the content under the same or similar conditions as the original.


SOAP (Simple Object Access Protocol) is a message-based protocol based on XML for accessing services on the Web. It employs XML syntax to send text commands across the Internet using HTTP. SOAP is similar in purpose to the DCOM and CORBA distributed object systems, but is more lightweight and less programming-intensive. Because of its simple exchange mechanism, SOAP can also be used to implement a messaging system.

Source code 

The files of computer code written by programmers that are used to produce a piece of software. The source code is usually converted or ‘compiled’ into a form that the user’s computer can execute. The user therefore never sees the original source code, unless it is published as open source.


A query language similar to SQL, used for queries to a linked-data triple store.


A table of data and calculations that can be processed interactively with a specialised spreadsheet program such as Microsoft Excel or OpenOffice Calc.


Structured Query Language, a standard language used for interrogating many types of database. See query.


Single-Sign-On procedure, allowing user to input the credentials just once for the whole set of the used modules or sub-accounts normally requesting new log-in each.


A published specification for, e.g., the structure of a particular file format, recommended nomenclature to use in a particular domain, a common set of metadata fields, etc. Conforming to relevant standards greatly increases the value of published data by improving machine readability and easing data integration.

Storage in ODN

Modular constituent of ODN. After transformation, the data produced by ODN/UnifiedViews are to be stored in Storage module. Further, the ODN/Publication module uses ODN/Storage to get the transformed data, so that it can be published - provided to data consumers.

Two important components of ODN/Storage are RDBMS data mart and RDF data mart.


ODN/Storage notifies modul ODN/Publication about changes which happened (dataset updates, etc.) so that ODN/Publication can adapt to the changes.

 ODN/Management may query ODN/Storage to get statistics about stored data

Structured data 

All data has some structure, but ‘structured data’ refers to data where the structural relation between elements is explicit in the way the data is stored on a computer disk. XML and JSON are common formats that allow many types of structure to be represented. The internal representation of, for example, word-processing documents or PDF documents reflects the positioning of entities on the page, not their logical structure, which is correspondingly difficult or impossible to extract automatically.


A specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The overarching goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters and models are tightly integrated into the server code, allowing APIs to always stay in sync.

Tab-separated values 

Tab-separated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly machine-readable.

Terms of Service*Terms of Service provide a legal framework for developers to operate within. They set the stage for the business development relationships that will occur within an API ecosystem. Terms of Service should protect the API owner’s company, assets and brand, but should also provide assurances for developers who are building businesses on top of an API.

Tragedy of the anti-commons 

The well-known tragedy of the commons occurs when a common resource, such as grazing land, is degraded through over-use. Effectively, users are treating a limited resource as if it were limitless, owing to a poor incentive structure. The economist Michael Heller coined the term ‘tragedy of the anti-commons’ to describe the opposite failure, where poor incentives lead to under-use of an abundant or limitless resource. The case of data which is unpublished or charged for at above marginal cost is a prime example, data being in fact a limitless resource.


Governments and other organisations are said to be transparent when their workings and decision-making processes are well-understood, properly documented and open to scrutiny. Transparency is one of the aspects of open government. An increase in transparency is one of the benefits of open data.

Transport data 

Public transport routes, timetables and real time data are valuable but difficult candidates for open data. Even when they are published, data from different transit authorities and companies may not be available in compatible formats, making it difficult for third parties to provide integrated transport information. Many transport authorities distribute public transport data using the General Transit Feed Specification (GTFS) which is maintained by Google. Work on standardisation and more open data is ongoing in the sector.

Triple store 

The ‘triples’ of RDF data can be stored in a specialised database, called a triple store, against which queries can be made in the query language SPARQL.


see Table separated values.


A meeting, similar to a conference, but with no agenda fixed in advance. Using various established techniques, participants jointly agree on the day what sessions will run. Some more traditional conference sessions with invited speakers may also be included. A popular format among the tech community, an unconference can be combined with or run alongside a hackathon based on open data. It is a possible method of community engagement by data publishers.


An Extract-Transform-Load (ETL) framework that allows users – publishers, consumers, or analysts – to define, execute, monitor, debug, schedule, and share RDF data processing tasks. UnifiedViews is one of the core components of  Open Data Node – publication platform for Open data. It works with pipelines, built of DPUs.

Unique identifier 

(or UID): An identifier for an object which is guaranteed to be different from identifiers of all other objects in a collection. Within a database, every object will have a UID that is unique within the database. A UID assigned by a central registry (such as an ISBN for books, or a DOI for data) will be unique for all objects for which it is assigned. The http://… identifiers of linked data provide a technique for guaranteeing UIDs without a central authority.

Unstructured Data

Data that is more free-form, such as multimedia files, images, sound files, or unstructured text. Unstructured data does not necessarily follow any format or hierarchical sequence, nor does it follow any relational rules. Unstructured data refers to masses of (usually) computerized information which do not have a data structure which is easily readable by a machine.

Examples of unstructured data may include audio, video and unstructured text such as the body of an email or word processor document. Data mining techniques are used to find patterns in, or otherwise interpret, this information. It seems than 85 percent of all business information exists as unstructured data – commonly appearing in e-mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations, and Web pages.


Uniform resource identifier giving a naming/identifying label to the resource, be it the name, location, citation, etc. Well known URL (http) or URN are the subspaces of the URI, therefore are URIs themselves, all serving for identification of the data resource.


Uniform Resource Identifier / Uniform Resource Locator. A URL is the http://... web address of some page or resource. When a URL is used in linked data as the identifier for some object, it is not strictly a locator for the object (e.g. is the location of a document about Paris, but not of Paris itself), so in this context it is referred to as a URI.


A visual representation of data is often the most compelling way of communicating the data, bringing out its key features, correlations and outliers. Though many tools exist, creating a visualisation for a dataset is not an automatic process, but requires careful attention to the meaning of the variables, the relations between them and the stories inherent in the data, to design a visual representation that lets the message of the data shine through.


A standard specifying the identifiers to be used for a particular collection of objects. Using standard vocabularies where they exist is key to enabling data integration. Linked data is rich in vocabularies in different topic areas.


The World Wide Web, the vast collection of interlinked and linkable documents and services accessible via 'web browsers' over the [Internet](/glossary/en/terms/internet/).

Web API 

An API that is designed to work over the Internet.

A way computer programs talk to one another. Can be understood in terms of how a programmer sends instructions between programs.

Web Service

A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards.


An XML-based language (Web Services Description Language) used to describe the services a business offers and to provide a way for individuals and other businesses to access those services electronically.


A proprietary spreadsheet format, the native format of the popular Microsoft Excel spreadsheet package. Older versions use .xls files, while more recent ones use the XML-based .xlsx variant.


Extensible Markup Language (XML) is a flexible language for  representing structured data - creating common information formats and sharing both the format and content of data over the Internet and elsewhere. XML is a formatting language recommended by the World Wide Web Consortium (W3C).


adapted from:

  • No labels