data quality requirement | Anastasija Nikiforova, PhD

June 5, I was delighted to be invited to be a keynote at the HackCodeX Forum, delivering a keynote titled “Data Quality as a prerequisite for your business success: when should I start taking care of it?“ in my hometown – Riga, Latvia. HackCodeX Forum is a one-day event where international experts share their experience and knowledge about emerging technologies and areas such as Artificial Intelligence, Security, Data Quality, Quantum Computing, Sustainability, Open Data, Privacy, Ethics, Digital Services (with a keynote from CEO of SK ID Solutions – one of the solutions that make Estonia the #1 digital nation) etc. This time I was invited to cover the topic of Data Quality and I was happy to do so, especially considering the fact that the HackCodeX Forum is an event that closes one of the leading hackathons in Europe, which Riga was fascinated and passionated about, and this is evidenced by the rich list of advertisement we all saw in the last weeks and months (Delfi, Haker.lv, kripto.media, kursors.lv, labsoflatvia.lv to name just a few), which this year held in Latvia and brought together around 500 developers, designers and entrepreneurs to create and innovate, solving 5 challenges of this year:

🏆 ATEA challenge: Minimise manual work and drive data-powered decision-making
🏆 Emergn challenge: Improve the quality of life for people with disabilities
🏆 UI.COM & Riga TechGirls challenge: Help shoppers make more sustainable purchasing decisions
🏆 Game Changer Audio (GCA) challenge: Identify each individual note by listening to notes being played real-time
🏆 Ministry of Education and Science challenge: Help make education hackable again!

Form me, in turn, yet another audience, yet another experience.

In short, in this Star Wars-style presentation (yes, I am a fan, and given the number of DQ memes in this style, I am not an exception and cannot say that I am a geek or a weird person, but rather a normal DQ/IT person), I urged “help R2D2 save the galaxy!“.

Images from: History in Objects: Death Star Plans Datacard • Lucasfilm, Video Analysis of an Exploding Death Star | WIRED, Post | LinkedIn, Destruction of Despayre | Wookieepedia | Fandom. Special thanks to George Firican for the idea and inspiration!

In a bit more detail, I elaborated on the importance and the relevance of the data quality regardless of the age of this topic [that is older than me], data quality management and the factors the DQM approach depends on. The popularity and importance of the topic is undoubtfully due to the amount of the data we are dealing with and the fact that we are living in the data-driven world, where data are everywhere – they are generated continuously, by multiple sources, which is not only about our devices, or sensors, but also about ourselves (however, with the help of the two above). This led to the fact that some time ago data have been claimed to be a new oil. Have you heard this? I am sure you were. But have you thought about this statement? is it true? false? something in between? Bingo! While there are commonalities between data and oil, they are rather small in number. One interesting reading devoted to this comes form Forbes. I.e. they admit that both artifacts – oil and data – can be seen as similar since both are “power”, including being the power of those, who own them. In other words, they compare data owners such as Alibaba, Google, Twitter, Facebook etc. to oil barons (100 years back from now). But, otherwise, more in-depth comparative analysis reveal mostly differences. To name just a few:

💡 oil is a finite resource, while data are not. Instead, data are effectively infinitely durable and reusable and treating them like oil, i.e. storing in siloes, reduces their value, usefulness and potential as whole;

💡another difference is in transportation, where oil requires huge amounts of resources to be transported to where and when it is needed, while for the data – they can be replicated indefinitely and moved around the world at very high speeds and, more importantly, at very low costs;

💡 Yet another difference lies in the usability of both – oil and data – when they have been already used once. While for the oil, when it is used, its energy is being lost (as heat or light), or permanently converted into another form such as plastic, data usefulness, in contrast, tend to increase with their actual usage, i.e. new uses arose, data are turned into training data at the very end etc.;

💡 as the world’s oil reserves dwindle, extracting it become increasingly difficult and expensive, while for the data – they are becoming increasingly available, incl. but not limited due to the technology advances as well as due to a high number and amount of data producers;

💡 and the last but not the least, oil drilling involves causing damage to the natural environment and exploitation of finite natural resources, while data mining doesn’t – at least there is no intrinsic damage to the environment and exploitation of finite natural resources. Of course, here we do not mention (but should not forget about) the electricity used to run the system and relatively low tendency of green computing (aka sustainable computing) for their further processing.

Thus, as Forbes suggests, if we want to talk about the data as a power source or fuel, it make much more sense to compare them with renewable sources 🌎🌎🌎 such as the sun ☀️, wind 💨 and tides 🌊. All in all, data can be seen to be more than oil. Hence the popularity and importance of the data quality topic.

The factors that can affect the DQM approach, in turn, can be different, starting with those implying from the relative nature of the data quality as a phenomenon, i.e., the definition, variety of (and non-ambiguity of) data quality dimensions, to which the data quality metrics are expected to be selected, DQ dynamism, dependence on the user and use-case etc. (some of the above are discussed in “Towards a data quality framework for EOSC“ and “Definition and Evaluation of Data Quality: a user-oriented data object-driven approach to data quality assessment”), as well as the data artifact whose quality is under analysis. In other words, is this about the data object or dataset? Database? Data repository? Information system?

If it is a data object, the next “level” of factors is data owner – known or unknown (third-party data such as open data), and their structure – structured, semi-structured, unstructured data?

While for the Information Systems / Software, I find that “think data quality first” and “data quality by design” are two mantras to be kept in mind. The later, however, is something we have studied together with my colleagues from Mexico , coming up with this modification of “quality by design” principle into “data quality by design”. I reported on the respective study before – “ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards data quality by design” (read here), where we proposed DAQUAVORD – a Methodology for Project Management of Data Quality Requirements Specification, which is based on the Viewpoint-Oriented Requirements Definition (VORD) method, and the latest and most generally accepted ISO/IEC 25012 standard, whose main idea was to start thinking of data quality as soon as the development of the system start to make sure that some data quality level is ensured by the design, i.e. transformed into both functional and non-functional requirements.

Alternatively, it can be done not necessarily before, but also during the development or even when the system is already in production. Some solutions exist here, but I typically use the opportunity to self-advertise previous projects and studies that I worked on, especially this one since it was based on the results of my PhD thesis, which is summarized “Definition and Evaluation of Data Quality: a user-oriented data object-driven approach to data quality assessment”, namely, Data Quality Model-based testing approach (DQMBT) for testing information systems that uses the data object-driven data quality model as a testing model, which was presented in the context of e-scooter system and Insurance System. Both, however, are rather ad-hoc approaches, whose main value lies in the conceptual idea, not the implementation, at least at this point.

For the repository, in turn, whether it is about the data warehouse, data lake? Or maybe even data lakehouse? For the later two, metadata and data governance become “must” to avoid GIGO (garbage in – garbage out effect) and turning the data lake into a data swamp, which is slightly addressed in “Combining data lake an data wrangling for ensuring data quality in CRIS“, incl. but not limited elaborating on why data wrangling should be given the preference over data cleaning.

The importance of both metadata and data governance was then emphasized, where for the later, the support from Elon Musk has been asked 😀 He was rather mentioned to support the speculations of data governance importance, which was once mentioned by him as a key to improve the product you are delivering, and I just wanted to make my words a bit more authoritative, i.e. he is seen to be more or less successful businessman, isn’t he? 😀

You can find slides here or watch the video 👇

Big thanks to both the organizers – Helve, and supporters, who made both the hackathon and the forum a success. More precisely, Techchill, techhub, Lift 99, #RigaTechGirls, justjoin.it, Oradea.Tech.Hub, RTU design Factory, Startup Lithuania, Kaunas Technology University, Stratup Estonia, Spring Hub. kood / Johvi, Technopol, Enterprise Forum CEE, Slush, Aaltos, AWS (Amazon Web Services), Google for Startups, Junction, Bird Incubator, EdTech Estonia. Sphere,it, Codecamp, Nine brains, Draper Startup House, Eiropas Digitālās inovācijas centrs, 28Stone.

And some more very special actors of the community, who were in the core of this hackathon edition – Emergn, Izglītības un zinātnes ministrija (Ministry of Education and Science), EPAM Systems Latvia, Atea Global Services Ltd., Ubiquiti Inc. & RigaTechGirls, Investment and Development Agency of Latvia (LIAA).

It is obvious that users should trust the data that are managed by software applications constituting the Information Systems (IS). This means that organizations should ensure an appropriate level of quality of the data they manage in their IS. Therefore, the requirement for the adequate level of quality of data to be managed by IS must be an essential requirement for every organization. Many advances have been done in recent years in software quality management both at the process and product level. This is also supported by the fact that a number of global standards have been developed and involved, addressing some specific issues, using quality models such as (ISO 25000, ISO 9126), those related to process maturity models (ISO 15504, CMMI), and standards focused mainly on software verification and validation (ISO 12207, IEEE 1028, etc.). These standards have been considered in worldwide for over 15 years.

However, awareness of software quality depends on other variables, such as the quality of information and data managed by application. This is recognized by SQUARE standards (ISO/IEC 25000), which highlight the need to deal with data quality as part of the assessment of the quality level of the software product, according to which “the target computer system also includes computer hardware, non-target software products, non-target data, and the target data, which is the subject of the data quality model”. This means that organizations should take into account data quality concerns when developing various software, as data is a key factor. To this end, we stress that such data quality concerns should be considered at the initial stages of software development, attending the “data quality by design” principle (with the reference to the “quality by design” considered relatively often with significantly more limited interest (if any) to “data quality” as a subset of the “quality” concept when referring to data / information artifacts).

The “data quality” concept is considered to be multidimensional and largely context dependent. For this reason, the management of specific requirements is a difficult task. Thus, the main objective of our new paper titled “ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards data quality by design” is to present a methodology for Project Management of Data Quality Requirements Specification called DAQUAVORD aimed at eliciting DQ requirements arising from different users’ viewpoints. These specific requirements should serve as typical requirements, both functional and non-functional, at the time of the development of IS that takes Data Quality into account by default leading to smarter and collaborative development.

In a bit more detail, we introduce the concept of Data Quality Software Requirement as a method to implement a Data Quality Requirement in an application. Data Quality Software Requirement is described as a software requirement aimed at satisfying a Data Quality Requirement. The justification for this concept lies in the fact that we want to capture the Data Quality Software Requirements that best match the data used by a user in each usage scenario, and later, originate the consequent Data Quality Software Requirements that will complement the normal software requirements linked to each of those scenarios. Addressing multiple Data Quality Software Requirements is indisputably a complex process, taking into account the existence of strong dependencies such as internal constraints and interaction with external systems, and the diversity of users. As a result, they tend to impact and show the consequences of contradictory overlaps on both process and data models.

In terms of such complexity and attempting to improve the developing efforts, we introduce DAQUAVORD, a Methodology for Project Management of Data Quality Requirements Specification, which is based on the Viewpoint-Oriented Requirements Definition (VORD) method, and the latest and most generally accepted ISO/IEC 25012 standard. It is universal and easily adaptable to different information systems in terms of both their nature, number and variety of actors and other aspects. The paper proposes both the concept of the proposed methodology and an example of its application, which is a kind of manual step-by-step guidance on how to use it to achieve smarter software development with data quality by design. This paper is a continuation of our previous study. This paper establishes the following research questions (RQs):

RQ1: What is the state of the art regarding the “data quality by design” principle in the area of software development? What are (if any) current approaches to data quality management during the development of IS?

RQ2: How the concepts of the Data Quality Requirements (DQR) and the Viewpoint-Oriented Requirements Definition (VORD) method should be defined and implemented in order to promote the “data quality by design” principle?

Sounds interesting? Read the full-text of the article published in Elsevier Data & Knowledge Engineering – here.

The first comprehensive approach to this problematic is presented in this paper, setting out the methodology for project management of the specification for data quality requirements. Given the relative nature of the concept of “data quality” and active discussions on the universal view on the data quality dimensions, we have based our proposal on the latest and most generally accepted ISO/IEC 25012 standard, thus seeking to achieve a better integration of this methodology with existing documentation and systems or projects existing in the organization. We suppose that this methodology will help Information System developers to plan and execute a proper elicitation and specification of specific data quality requirements expressed by different roles (viewpoints) that interact with the application. This can be assumed as a guide that analysts can obey when writing a Requirements Specification Document supplemented with Data Quality management. The identification and classification of data quality requirements at the initial stage makes it easier to developers to be aware of the quality of data to be implemented for each function during all development process of the application.

As future work thinking, we plan to consider the advantages provided by the Model Driven Architecture (MDA), focusing mainly on its capabilities of both abstraction and modelling characteristics. It will be much easier to integrate our results into the development of “Data Quality aware Information Systems” (DQ-aware-IS) with other software development methodologies and tools. This, however, is expected to expand the scope of the developed methodology and consider various feature related to data quality, including the development of a conceptual measure of data value, i.e., intrinsic value, as proposed in.

UPDATE: In July 2023 it also became one of the most downloaded articles from Data & Knowledge Engineering (Elsevier) in the last 90 days – have not read it yet? take a look, it is waiting for your reading 😉

César Guerra-García, Anastasija Nikiforova, Samantha Jiménez, Héctor G. Perez-Gonzalez, Marco Ramírez-Torres, Luis Ontañon-García, ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards Data Quality by Design, Data & Knowledge Engineering, 2023, 102152, ISSN 0169-023X, https://doi.org/10.1016/j.datak.2023.102152

Anastasija Nikiforova, PhD

Associate Professor of Applied AI and Information Systems; Data ecosystems, data governance and responsible technology adoption researcher

Tag data quality requirement

HackCodeX Forum Keynote “Data Quality as a prerequisite for you business success: when should I start taking care of it?”

Towards data quality by design – ISO/IEC 25012-based methodology for managing DQ requirements in the development of IS – one of the most downloaded article of DKE from July 2023