Data Quality

The topic of data quality was developed over the years and resulted in a series of articles, as well as a doctoral thesis (actually, my master thesis defended with a distinction and recognized to be the best Master’s Thesis in Computer Science of the year (by ZIBIT), was also devoted to it), in which the user-centered data object-driven approach for data quality assessment was developed and applied to open data and, open government data (OGD) in particular, served as a domain of application.

The proposed approach served as an input for several projects, including DQMBT – data quality model-based testing approach for information systems developed and applied for real-life projects where the e-scooter system was a central object thereby demonstrating the proposed approach in the context of IoT or IoV (Internet of Vehicles). This example, however, has served as an input to the current study on car-sharing services in Latvia, more specifically on optimization processes read more…

Later, data quality analysis became an integral part of my activities, including my experience with Latvian Biomedical Research and Study Centre (BBMRI-ERIC Latvian National Node), where I have inspected the current data ecosystem of both Latvian Biomedical Research and Study Centre and related data artifacts, including Latvian Genome database. This resulted in a set of guidelines towards efficient data management for heterogeneous data holders and exchangers developed as a deliverable of the HORIZON2020 INTEGROMED project (Deliverable 2.1 Guidelines for the maintenance of efficient biobank, health register and research associated data“) and presented during the European Biobank Week 2021 (read more about it here…). It, however, serves as an input for the DECIDE – Development of a dynamic informed consent system for biobank and citizen science data management, quality control and integration. In addition, as part of the digitization of processes running within the LBMC, I have developed and introduced the set of e-surveys to substitute paper-based surveys and questionnaires, which were typically completed by hand by either patients, donors or done doctorates and then inserted in the database by manual rewriting of answers from paper-based surveys. Therefore, data tended to be incomplete, non-compliant with the database design, inaccurate errors are made, when the person reenter the data to the database from the paper-based survey or interpret the answer to made it compliant with it. E-surveys allow to ensure data integrity and in-built data quality checks, and compliance with the actual database and reduce the number of errors made within the process of data collection and insertion.

September 2021 I have joined European Open Science Cloud (EOSC) Association Task Force FAIR Metrics and Data Quality, which oversees the implementation of FAIR metrics for the EOSC, testing them with research communities to ensure they are fit for purpose.

Source: European Open Science Cloud (EOSC)

Very briefly on the data object-driven approach to data quality evaluation mentioned above. It consists of three main components: (1) a data object, (2) data quality requirements, and (3) data quality evaluation process. As data quality is of relative nature, the data object and quality requirements are (a) use-case dependent and (b) defined by the user in accordance with his needs. All three components of the presented data quality model are described using graphical Domain Specific Languages (DSLs). A data quality model is executable, enabling data object scanning and detecting data quality defects and anomalies.

The list of articles (both, journal articles and conference papers) is presented below.

I have also delivered some talks on this and related topics, including:

In addition, in 2021 I was honored to deliver two Guest Lectures, with one of them closely related the data quality. It was given to the students of the University of South-Eastern Norway (USN) and was mainly focused on the open data – its ecosystem, barriers, current and future trends in both worldwide and Norway context (see slides here), however with a strong emphasis on the data quality, which the audience found to be especially interesting (read more…)