Our – EOSC TF “FAIR Metrics and Data Quality” paper “Towards a data quality framework for EOSC” is released!🍷🍷🍷

I am glad to announce the release of “Towards a data quality framework for EOSC” document, which we have been hard at work on hard for several months as the Data Quality subgroup of the “FAIR Metrics and Data Quality” Task Force European Open ScienceCloud (EOSC) Association) – Carlo Lacagnina, Romain David, Anastasija Nikiforova, Mari Elisa Kuusniemi, Cinzia Cappiello, Oliver Biehlmaier, Louise Wright, Chris Schubert, Andrea Bertino, Hannes Thiemann, Richard Dennis.

This document explains basic concepts to build a solid basis for a mutual understanding of data quality in a multidisciplinary environment such as EOSC. These range from the difference between quality control, assurance, and management to categories of quality dimensions, as well as typical approaches and workflows to curate and disseminate dataset quality information, minimum requirements, indicators, certification, and vocabulary. These concepts are explored considering the importance of evaluating resources carefully when deciding the sophistication of the quality assessments. Human resources, technology capabilities, and capacity-building plans constrain the design of sustainable solutions. Distilling the knowledge accumulated in this Task Force, we extracted cross-domain commonalities (each TF member brings his / her own experience and knowledge – we all represent different domains and therefore try to make our contributions domain-agnostic, but at the same time considering every nuance that our specialism can bring and what deserves to be heard by others), as well as lessons learned, and challenges.

The resulting main recommendations are:

  1. Data quality assessment needs standards to check data against; unfortunately, not all communities have agreed on standards, so EOSC should assist and push each community to agree on community standards to guarantee the FAIR exchange of research data. Although we extracted a few examples highlighting this gap, the current situation requires a more detailed and systematic evaluation in each community. Establishing a quality management function can help in this direction because the process can identify which standard already in use by some initiatives can be enforced as a general requirement for that community. We recommend that EOSC considers taking the opportunity to encourage communities to reach a consensus in using their standards.
  2. Data in EOSC need to be served with enough information for the user to understand how to read and correctly interpret the dataset, what restrictions are in place to use it, and what processes participate in its production. EOSC should ensure that the dataset is structured and documented in a way that can be (re)used and understood. Quality assessments in EOSC should not be concerned with checking the soundness of the data content. Aspects like uncertainty are also important to properly (re)use a dataset. Still, these aspects must be evaluated outside the EOSC ecosystem, which only checks that evidence about data content assessments is available. Following stakeholders’ expectations, we recommend that EOSC is equipped with essential data quality management, i.e., it should perform tasks like controlling the availability of basic metadata and documentation and performing basic metadata compliance checks. The EOSC quality management should not change data but point to deficiencies that the data provider or producer can address.
  3. Errors found by the curators or users need to be rectified by the data producer/provider. If not possible, errors need to be documented. Improving data quality as close to the source (i.e., producer or provider) as possible is highly recommended. Quality assessments conducted in EOSC should be shown first to the data provider to give a chance to improve the data and then to the users.
  4. User engagement is necessary to understand the user requirements (needs, expectations, etc.); it may or may not be part of a quality management function. Determining and evaluating stakeholder needs is not a one-time requirement but a continuous and collaborative part of the service delivery process.
  5. It is recommended to develop a proof-of-concept quality function performing basic quality assessments tailored to the EOSC needs (e.g., data reliability and usability). These assessments can also support rewarding research teams most committed to providing FAIR datasets. The proof-of-concept function cannot be a theoretical conceptualization of what is preferable in terms of quality. Still, it must be constrained by the reality of dealing with an enormous amount of data within a reasonable time and workforce.
  6. Data quality is a concern for all stakeholders, detailed further in this document. The quality assessments must be a multi-actor process between the data provider, EOSC, and users, potentially extended to other actors in the long run. The resulting content of quality assessments should be captured in structured, human- and machine-readable, and standard-based formats. Dataset information must be easily comparable across similar products, which calls for providing homogeneous quality information.
  7. A number of requirements valid for all datasets in EOSC (and beyond) and specific aspects of a maturity matrix gauging the maturity of a community when dealing with quality have been defined. Further refinement will be necessary for the future, and specific standards to follow will need to be identified.

We sincerely invite you to take a look at this very concise 76-pages long overview of the topic and look forward to your recommendations / suggestions / feedback – we hope to provide you with the opportunity to communicate the above conveniently very soon, so take your time to read, while we are making our last preparations 📖 🍷📖🍷📖🍷 But make sure you have a glass of wine at the time of reading it, as this will make sense at some point of reading, i.e. when we compare data quality with wine quality with reference to both flavour type and intensity (intrinsic quality), brand, packaging (extrinsic quality)… but no more teasers and bon appetite! 🍷🍷🍷
The document can be found in an Open Access here.

We also want to acknowledge the contribution and input of colleagues from several European institutions, the EOSC Association and several external-to-TF stakeholders who gave feedback based on their own experience, and the TF Support Officer Paola Ronzino, as well as to our colleagues – Sarah Stryeck and Raed Al-Zoubi, and the last but not the list – to all respondents and everyone involved.

14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K): how it was and who got the Best Paper Award?

In this post I would like to briefly elaborate on a truly insightful 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), where I was honored to participate as a speaker, presenting our paper “Putting FAIR principles in the context of research information: FAIRness for CRIS and CRIS for FAIRness” (authors: Otmane Azeroual, Joachim Schopfel and Janne Polonen, and Anastasija Nikiforova), and as a chair of two absolutely amazing sessions, where live and fruitful discussions took place, which is a real indicator of the success of such! And spoiler, our paper was recognized as the Best Paper! (i.e., best paper award goes to… :))

IC3K consists of three subconferences, namely 14th International Conference on Knowledge Discovery and Information Retrieval (KDIR), 14th International Conference on Knowledge Engineering and Ontology Development (KEOD), and 14th International Conference on Knowledge Management and Information Systems (KMIS), where the latter is the one, to which my paper has been accepted, and also won the Best Paper Award – I know, this is a repetition, but I am glad to receive it, same as the euroCRIS community is proud for us – read more here…!

Briefly about our study, with which we mostly wanted to urge a call for action in the area of CRIS and their FAIRness. Of course, this is all about the digitization, which take place in various domain, including but not limited to the research domain, where it refers to the increasing integration and analysis of research information as part of the research data management process. However, it is not clear whether this research information is actually used and, more importantly, whether this information and data are of sufficient quality, and value and knowledge could be extracted from them. It is considered that FAIR principles (Findability, Accessibility, Interoperability, Reusability) represent a promising asset to achieve this. Since their publication (by one of the colleagues I work together in European Open Science Cloud), they have rapidly proliferated and have become part of both national and international research funding programs. A special feature of the FAIR principles is the emphasis on the legibility, readability, and understandability of data. At the same time, they pose a prerequisite for data and their reliability, trustworthiness, and quality. In this sense, the importance of applying FAIR principles to research information and respective systems such as Current Research Information Systems (CRIS, also known as RIS, RIMS), which is an underrepresented subject for research, is the subject of our study. What should be kept in mind is that the research information is not just research data, and research information management systems such as CRIS are not just repositories for research data. They are much more complex, alive, dynamic, interactive and multi-stakeholder objects. However, in the real-world they are not directly subject to the FAIR research data management guiding principles. Thus, supporting the call for the need for a ”one-stop-shop and register-once use-many approach”, we argue that CRIS is a key component of the research infrastructure landscape / ecosystem, directly targeted and enabled by operational application and the promotion of FAIR principles. We hypothesize that the improvement of FAIRness is a bidirectional process, where CRIS promotes FAIRness of data and infrastructures, and FAIR principles push further improvements to the underlying CRIS. All in all, three propositions on which we elaborate in our paper and invite  everyone representing this domain to think of, are:

1. research information management systems (CRIS) are helpful to assess the FAIRness of research data and data repositories;

2. research information management systems (CRIS) contribute to the FAIRness of other research infrastructure;

3. research information management systems (CRIS) can be improved through the application of the FAIR principles.

Here, we have raised a discussion on this topic showing that the improvement of FAIRness is a dual or bidirectional process, where CRIS promotes and contributes to the FAIRness of data and infrastructures, and FAIR principles push for further improvement in the underlying CRIS data model and format, positively affecting the sustainability of these systems and underlying artifacts. CRIS are beneficial for FAIR, and FAIR is beneficial for CRIS. Nevertheless, as pointed out by (Tatum and Brown, 2018), the impact of CRIS on FAIRness is mainly focused on the (1) findability (“F” in FAIR) through the use of persistent identifiers and (2) interoperability (“I” in FAIR) through standard metadata, while the impact on the other two principles, namely accessibility and reusability (“A” and “R” in FAIR) seems to be more indirect, related to and conditioned by metadata on licensing and access. Paraphrasing the statement that “FAIRness is necessary, but not sufficient for ‘open’” (Tatum and Brown, 2018), our conclusion is that “CRIS are necessary but not sufficient for FAIRness”.

This study differs significantly from what I typically talk about, but it was to contribute to it, thereby sharing the experience I gain in European Open Science Cloud (EOSC), and respective Task Force I am involved in – “FAIR metrics and data quality”. It also allowed me to provide some insights on what we are dealing with within this domain and how our activities contribute to the currently limited body of knowledge on this topic.

A bit about the sessions I chaired and topics raised within them, which were very diverse but equally relevant and interesting. I was kindly invited to chair two sessions, namely “Big Data and Analytics” and “Knowledge management Strategies and Implementations”, where the papers on the following topics were presented:

  • Decision Support for Production Control based on Machine Learning by Simulation-generated Data (Konstantin Muehlbauer, Lukas Rissmann, Sebastian Meissner, Landshut University of Applied Sciences, Germany);
  • Exploring the Test Driven Development of a Fraud Detection Application using the Google Cloud Platform (Daniel Staegemann, Matthias Volk, Maneendra Perera, Klaus Turowski, Otto-von-Guericke University Magdeburg, Germany) – this paper was also recognized as the best student paper;
  • Decision Making with Clustered Majority Judgment (Emanuele D’ajello , Davide Formica, Elio Masciari, Gaia Mattia, Arianna Anniciello, Cristina Moscariello, Stefano Quintarelli, Davide Zaccarella, University of Napoli Federico II, Copernicani, Milano, Italy.
  • Virtual Reality (VR) Technology Integration in the Training Environment Leads to Behaviour Change (Amy Rosellini, University of North Texas, USA)
  • Innovation in Boutique Hotels in Valletta, Malta: A Multi-level Investigation (Kristina, University of Malta, Malta)

And, of course, as is the case for each and every conference, the keynotes are panelists are those, who gather the highest number of attendees, which is obvious, considering the topic they elaborate on, as well as the topics they raise and discuss. IC3K is not an exception, and the conference started with a very insightful discussion on Current Data Security Regulations and the discussion on whether they Serve or rather Restrict the Application of the Tools and Techniques of AI. Each of three speakers, namely Catholijn Jonker, Bart Verheijen, and Giancarlo Guizzardi, presented their views considering the domain they represent. As a result, both were very different, but at the same time leading you to “I cannot agree more” feeling!

One of panelists – Catholijn Jonker (TU Delft) delivered then an absolutely exceptional keynote speech on Self-Reflective Hybrid Intelligence: Combining Human with Artificial Intelligence and Logic. Enjoyed not only the content, but also the style, where the propositions are critically elaborated on, pointing out that they are not indented to serve as a silver bullet, and the scope, as well as side-effects should be determined and considered. Truly insightful and, I would say, inspiring talk.

All in all, thank you, organizers – INSTICC (Institute for Systems and Technologies of Information, Control and Communication), for bringing us together!