I am glad to announce the release of “Towards a data quality framework for EOSC” document, which we have been hard at work on hard for several months as the Data Quality subgroup of the “FAIR Metrics and Data Quality” Task Force European Open ScienceCloud (EOSC) Association) – Carlo Lacagnina, Romain David, Anastasija Nikiforova, Mari Elisa Kuusniemi, Cinzia Cappiello, Oliver Biehlmaier, Louise Wright, Chris Schubert, Andrea Bertino, Hannes Thiemann, Richard Dennis.
This document explains basic concepts to build a solid basis for a mutual understanding of data quality in a multidisciplinary environment such as EOSC. These range from the difference between quality control, assurance, and management to categories of quality dimensions, as well as typical approaches and workflows to curate and disseminate dataset quality information, minimum requirements, indicators, certification, and vocabulary. These concepts are explored considering the importance of evaluating resources carefully when deciding the sophistication of the quality assessments. Human resources, technology capabilities, and capacity-building plans constrain the design of sustainable solutions. Distilling the knowledge accumulated in this Task Force, we extracted cross-domain commonalities (each TF member brings his / her own experience and knowledge – we all represent different domains and therefore try to make our contributions domain-agnostic, but at the same time considering every nuance that our specialism can bring and what deserves to be heard by others), as well as lessons learned, and challenges.
The resulting main recommendations are:
- Data quality assessment needs standards to check data against; unfortunately, not all communities have agreed on standards, so EOSC should assist and push each community to agree on community standards to guarantee the FAIR exchange of research data. Although we extracted a few examples highlighting this gap, the current situation requires a more detailed and systematic evaluation in each community. Establishing a quality management function can help in this direction because the process can identify which standard already in use by some initiatives can be enforced as a general requirement for that community. We recommend that EOSC considers taking the opportunity to encourage communities to reach a consensus in using their standards.
- Data in EOSC need to be served with enough information for the user to understand how to read and correctly interpret the dataset, what restrictions are in place to use it, and what processes participate in its production. EOSC should ensure that the dataset is structured and documented in a way that can be (re)used and understood. Quality assessments in EOSC should not be concerned with checking the soundness of the data content. Aspects like uncertainty are also important to properly (re)use a dataset. Still, these aspects must be evaluated outside the EOSC ecosystem, which only checks that evidence about data content assessments is available. Following stakeholders’ expectations, we recommend that EOSC is equipped with essential data quality management, i.e., it should perform tasks like controlling the availability of basic metadata and documentation and performing basic metadata compliance checks. The EOSC quality management should not change data but point to deficiencies that the data provider or producer can address.
- Errors found by the curators or users need to be rectified by the data producer/provider. If not possible, errors need to be documented. Improving data quality as close to the source (i.e., producer or provider) as possible is highly recommended. Quality assessments conducted in EOSC should be shown first to the data provider to give a chance to improve the data and then to the users.
- User engagement is necessary to understand the user requirements (needs, expectations, etc.); it may or may not be part of a quality management function. Determining and evaluating stakeholder needs is not a one-time requirement but a continuous and collaborative part of the service delivery process.
- It is recommended to develop a proof-of-concept quality function performing basic quality assessments tailored to the EOSC needs (e.g., data reliability and usability). These assessments can also support rewarding research teams most committed to providing FAIR datasets. The proof-of-concept function cannot be a theoretical conceptualization of what is preferable in terms of quality. Still, it must be constrained by the reality of dealing with an enormous amount of data within a reasonable time and workforce.
- Data quality is a concern for all stakeholders, detailed further in this document. The quality assessments must be a multi-actor process between the data provider, EOSC, and users, potentially extended to other actors in the long run. The resulting content of quality assessments should be captured in structured, human- and machine-readable, and standard-based formats. Dataset information must be easily comparable across similar products, which calls for providing homogeneous quality information.
- A number of requirements valid for all datasets in EOSC (and beyond) and specific aspects of a maturity matrix gauging the maturity of a community when dealing with quality have been defined. Further refinement will be necessary for the future, and specific standards to follow will need to be identified.
We sincerely invite you to take a look at this very concise 76-pages long overview of the topic and look forward to your recommendations / suggestions / feedback – we hope to provide you with the opportunity to communicate the above conveniently very soon, so take your time to read, while we are making our last preparations 📖 🍷📖🍷📖🍷 But make sure you have a glass of wine at the time of reading it, as this will make sense at some point of reading, i.e. when we compare data quality with wine quality with reference to both flavour type and intensity (intrinsic quality), brand, packaging (extrinsic quality)… but no more teasers and bon appetite! 🍷🍷🍷
The document can be found in an Open Access here.
We also want to acknowledge the contribution and input of colleagues from several European institutions, the EOSC Association and several external-to-TF stakeholders who gave feedback based on their own experience, and the TF Support Officer Paola Ronzino, as well as to our colleagues – Sarah Stryeck and Raed Al-Zoubi, and the last but not the list – to all respondents and everyone involved.