Our new article titled “Towards High-Value Datasets determination for data-driven development: a systematic literature review” (Nikiforova A., Rizun N., Ciesielska M., Alexopoulos C., Miletič A.) is now available at arXiv with supplementary data published at Zenodo and waiting for your read! Moreover, this is not only my recommendation – The Living Library has included it in their collection, which as you can remember from my posts on another paper that was also recommended by them for the reading, seeks to provide actionable knowledge on governance innovation, informing and inspiring policymakers, practitioners, technologists, and researchers working at the intersection of governance, innovation, and technology in a timely, digestible and comprehensive manner, identifying the “signal in the noise” by curating research, best practices, points of view, new tools, and developments.
The OGD is seen as a political and socio-economic phenomenon that promises to promote civic engagement and stimulate public sector innovations in various areas of public life. However, to bring the expected benefits, data must be reused and transformed into value-added products or services. This, in turn, sets another precondition for data that are expected to not only be available and comply with open data principles, but also be of value, i.e., of interest for reuse by the end-user. This refers to the notion of ‘high-value dataset’ (HVD). HVD are defined as datasets whose re-use is expected to create the most value for society, the economy, and the environment, contributing to the creation of “value-added services, applications and new, high-quality and decent jobs, and of the number of potential beneficiaries of the value-added services and applications based on those datasets” (Directive, 2019). HVD was recognized by the European Data Portal as a key trend in the OGD area in 2022, which is not included in the annual Open Data Maturity Report.
There has been some progress in this area over the last years, which refers to a list of initiatives and studies carried out by several organizations and communities, where at the European level, probably most notable progress has been made by the European Commission in the Open Data Directive (originally Public Sector Information Directive (PSI Directive), i.e. Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information, according to which there are six thematic data categories of HVD – (1) geospatial, (2) earth observation and environment, (3) meteorological, (4) statistics, (5) companies and company ownership, (6) mobility data are considered as of high value. Further, a list of specific HVDs and the arrangements for their publication was developed and made available as “Commission Implementing Regulation (EU) 2023/138 of 21 December 2022 laying down a list of specific high-value datasets and the arrangements for their publication and re-use” (Commission, 2023) that can be seen as seeking for greater harmonization and interoperability of public sector data and data sharing across EU countries with reference to specific datasets, their granularity, key attributes, geographic coverage, requirements for their re-use, including licence (Creative Commons BY 4.0, any equivalent, or less restrictive open licence), specific format where appropriate, frequency of updates and timeliness, availability in machine-readable format, accessibility via API and bulk download, supported with metadata describing the data within the scope of the INSPIRE data themes that shall contain specific minimum set of the required metadata elements, description of the data structure and semantics, the use of controlled vocabularies and taxonomies (if relevant) etc. In addition, the Semantic Interoperability Community (SEMIC) is constantly hosting webinars on DCAT-AP (Data Catalogue Vocabulary Application Profile) for HVD to discuss with OGD portal owners, OGD publishers and enthusiasts the best approaches to use DCAT-AP to describe HVD and ensure their further findability, accessibility, and reusability.
In other words while it can be seen that progress has been made in this area, an examination of the above documents reveals that these datasets rather form a list of “mandatory” or “open by default”, sometimes also referred to as “base” or “core” datasets, aiming at open data interoperability with a high level of priority and a relatively equal level of value for most countries, which contributed to the development and promotion of a more mature open data ecosystem and OGD initiative. Depending on the specifics of a region and country – geographical location, social, current environment, social, economic issues, culture, ethnicity, likelihood of crises and / or catastrophes, (under)developed industries/ sectors and market specificities, and development trajectories, i.e., priorities. Depending on the above, more datasets can be recognized as having high value within a particular country or region (Utamachant & Anutariya, 2018; Huyer & Blank, 2020; Nikiforova, 2021). For example, meteorological data describing sea level rise can be of great value in the Netherlands as it has a strong impact on citizens and businesses as more than 1/3 of the country is below sea level, however, the same data will be less valuable for less affected to countries, such as Italy and France (Huyer & Blank, 2020). We believe that additional factors such as ongoing smart cities initiatives, as well as the Sustainable Development Goals, the current state of countries and cities in relation to their implementation and established priorities affect this list as well.
We find it is important to support the identification of country specific HVD that, in turn, could increase user interest ]by transforming data into innovative solution and services. Although this fact is recognized by countries and some local and regional efforts, mostly undertaken by governments with little support from the scientific and academia community, they are mainly faced with problems in the form of delays in their development or complete failure, or ending up with some set of HVD, but little information about how this was actually done. These ad-hoc attempts remain closed and not reusable, which is contrary to both the general OGD philosophy and the HVD-centric philosophy that is expected to be standardized. Most of them are ex-post or a combination of the ex-ante and ex-post, making the process of identifying them more resource-intensive, with an effect only visible after potentially valuable datasets have been discovered, published, and kept maintained, with the need for further evaluation of their impact, which is a resource-consuming task. All in all, it is considered that there is no standardized approach to assisting chief data officers in identifying HVDs, resulting in a failure in consistent identification and maintenance of HVDs.
Thus, we refer to this topic. As you can now from my blog it is not the first attempt we take. The very first activity related to this topic was taken by me back in 2019, where I studied this topic in Latvian settings, i.e. a stakeholder-centered determination of High-Value Data sets for Latvia was done as a response for the call made by the national OGD initiative, whose results were submitted to the holders of Latvia’s open data portal (Ministry of Environmental Protection and Regional Development) and used to prepare external reports submitted to Publications Office of the European Union). Later, several countries joined my study, namely, Poland, Greece, Croatia and Peru, and together with the colleagues we conducted several workshops that took place as part of international conferences, on which I posted before here and here.
This time, we conducted more theoretical study seeking for establishing a rich knowledge base for determining HVD, while the validation of identified indicators (as part of this study and derived from government reports) is expected to take place during the workshops with open (government) data and / or e-government experts. All in all, we focused on identifying all efforts taken with the reference to this topic. In other words, the objective was to examine how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks, which was done by conducting a Systematic Literature Review with the following research questions (RQ) defined to achieve the set objective:
- (RQ1) how is the value of the open government data perceived / defined? In which contexts has the topic of HVD been investigated by previous research (e.g., research disciplines, countries)? Are local efforts being made at the country levels to identify the datasets that provide the most value to stakeholders of the local open data ecosystem?
- (RQ1.1) How the high-value data are defined, if this definition differs from the definition introduced in the PSI /OD Directive,
- (RQ1.2) What datasets are considered to be of higher value in terms of data nature, data type, data format, data dynamism?
- (RQ2) What indicators are used to determine high-value datasets? How can these indicators be classified? Can they be measured? And whether this can be done (semi-)automatically?
- (RQ3) Whether there is a framework for determining country specific HVD? In other words, is it possible to determine what datasets are of particular value and interest for their further reuse and value creation, taking into account the specificities of the country under consideration, e.g., culture, geography, ethnicity, likelihood of crises and/or catastrophes.
Although neither OGD, nor the importance of the value of data are new topics, scholarly publications dedicated to the topic of HVD are still very limited. This points out the limited body of knowledge on this topic, thereby making this study unique and constituting a call for action. Nevertheless, during this study, we have established some knowledge based on HVD determination-related aspects, including several definitions of HVD, data-related aspects, stakeholders, some indicators and approaches that can now be used as a basis for establishing a discussion of what a framework for determining HVD should look like, which, along with the input we received from a series of international workshops with open (government) data experts, covering more indicators and approaches found to be used in practice, could enrich the common understanding of the goal, thereby contributing to the next open data wave (van Loenen & Šalamon, 2022).
Sounds interesting? Want to know more? Read the article -> here! Please cite the paper as: Nikiforova, A., Rizun, N., Ciesielska, M., Alexopoulos, C., Miletič, A. (2023). Towards High-Value Datasets determination for data-driven development: a systematic literature review. In: Lindgren, I., Csáki, C., Kalampokis, E., Janssen, M.,, Viale Pereira, G., Virkar, S., Tambouris, E., Zuiderwijk, A. Electronic Government. EGOV 2023. Lecture Notes in Computer Science. Springer, Cham
- Nikiforova, A. (2021, October). Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia. In Proceedings of the 14th International Conference on Theory and Practice of Electronic Governance (pp. 367-372).
- Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information (recast)
- Commission Implementing Regulation (EU) 2023/138 of 21 December 2022 laying down a list of specific high-value datasets and the arrangements for their publication and re-use
- Huyer, E., Blank, M. (2020). Analytical Report 15: High-value datasets: understanding the perspective of data providers. Publications Office of the European Union, 2020 doi:10.2830/363773
- Utamachant, P., & Anutariya, C. (2018, July). An analysis of high-value datasets: a case study of Thailand’s open government data. In 2018 15th international joint conference on computer science and software engineering (JCSSE) (pp. 1-6). IEEE
- van Loenen, B., & Šalamon, D. (2022). Trends and Prospects of Opening Data in Problem Driven Societies. Interdisciplinary Description of Complex Systems: INDECS, 20(2), II-IV