📢🚨⚠️Paper alert! Overlooked aspects of data governance: workflow framework for enterprise data deduplication

This time I would like to recommend for reading the new paper “Overlooked aspects of data governance: workflow framework for enterprise data deduplication” that has been just presented at the IEEE-sponsored International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023). This “just”, btw, means June 19 – the day after my birthday, i.e. so I decided to start my new year with one more conference and paper & yes, this means that again, as many of those who congratulated me were wishing – to find the time for myself, reach work-life balance etc., is still something I have to try to achieve, but this time, I decided to give a preference to the career over my personal life (what a surprise, isn’t it?) 🙂 Moreover, this is the conference, where I am also considered to be part of Steering committee, Technical Program committee, as well as publicity chair. During the conference, I also acted as a session chair of its first session, what I consider to be a special honor – for me the session was very smooth, interactive and insightful, of course, beforehand its participants & authors and their studies, which allowed us to establish this fruitful discussion and get some insights for our further studies (yes, I also got one beforehand one very useful idea for further investigation). Thank you all contributors, with special thanks to Francisco Bonilla Rivas, Bruck Wubete, Reem Nassar, Haitham Al Ajmi.

And I am also proud with getting one of four keynotes for this conference – prof. Eirini Ntoutsi from the Bundeswehr University Munich (UniBw-M), Germany, who delivered a keynote “Bias and Discrimination in AI Systems: From Single-Identity Dimensions to Multi-Discrimination“, which I heard during one of previous conferences I attended and decided that it is “must” for our conference as well – super glad that Eirini accepted our invitation! Here, I will immediately mention that other keynotes were excellent as well – Giancarlo Fortino (University of Calabria, Italy), Dofe Jaya (Computer Engineering Department, California State University, Fullerton, California, USA), Sandra Sendra (Polytechnic University of Valencia, Spain).

The paper I presented is authored in a team of three – Otmane Azeroual, German Centre for Higher Education Research and Science Studies (DZHW), Germany, myself – Anastasija Nikiforova, Faculty of Science and Technology, Institute of Computer Science, University of Tartu, Estonia & Task Force “FAIR Metrics and Data Quality”, European Open Science Cloud & Kewei Sha, College of Science and Engineering University of Houston Clear Lake, USA – very international team. So, what is the paper about? It is (or should be) clear that data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), product management, manufacturing, and marketing are used, duplicates, e.g., multiple entries for the same customer or product in a database or information system, occur. There can be several reasons for this (incl. but not limited due to the growing volume of data, incl. due to the adoption of cloud technologies, use of multiple different sources, the proliferation of connected personal and work devices in homes, stores, offices and supply chains), but the result of non-unique or duplicate records is a degraded data quality, which, in turn, ultimately leads to inaccurate analysis, poor, distorted or skewed decisions, distorted insights provided by Business Intelligence (BI) or machine learning (ML) algorithms, models, forecasts, and simulations, where the data form the input, and other data-driven activities such as service personalisation in terms of both their accuracy, trustworthiness and reliability, user acceptance / adoption and satisfaction, customer service, risk management, crisis management, as well as resource management (time, human, and fiscal), not to say about wasted resources, and employees, who are less likely trust the data and associated applications thereby affecting the company image. This, in turn, can lead to a failure of a project if not a business. At the same time, the amount of data that companies collect is growing exponentially, i.e., the volume of data is constantly increasing, making it difficult to effectively manage them. Thus, both ex-ante and ex-post deduplication mechanisms are critical in this context to ensure sufficient data quality and are usually integrated into a broader data governance approach. In this paper, we develop such a conceptual data governance framework for effective and efficient management of duplicate data, and improvement of data accuracy and consistency in medium to large data ecosystems. We present methods and recommendations for companies to deal with duplicate data in a meaningful way, while the presented framework is integrated into one of the most popular data quality tools – Data Cleaner.

In short, in this paper we:

  • first, present methods for how companies can deal meaningfully with duplicate data. Initially, we focus on data profiling using several analysis methods applicable to different types of datasets, incl. analysis of different types of errors, structuring, harmonizing, & merging of duplicate data;
  • second, we propose methods for reducing the number of comparisons and matching attribute values based on similarity (in medium to large databases). The focus is on easy integration and duplicate detection configuration so that the solution can be easily adapted to different users in companies without domain knowledge. These methods are domain-independent and can be transferred to other application contexts to evaluate the quality, structure, and content of duplicate / repetitive data;
  • finally, we integrate the chosen methods into the framework of Hildebrandt et al. [ref 2]. We also explore some of the most common data quality tools in practice, into which we integrate this framework.

After that, we test and validate the framework. The final refined solution provides the basis for subsequent use. It consists of detecting and visualizing duplicates, presenting the identified redundancies to the user in a user-friendly manner to enable and facilitate their further elimination.

With this paper we aim to support research in data management and data governance by identifying duplicate data at the enterprise level and meeting today’s demands for increased connectivity / interconnectedness, data ubiquity, and multi-data sourcing. In addition, the proposed conceptual data governance framework aims to provide an overview of data quality, accuracy and consistency to help practitioners approach data governance in a structured manner.

In general, not only technological solutions are needed that would identify / detect poor quality data and allow their examination and correction, or would ensure their prevention by integrating some controls into the system design, striving for “data quality by design” [ref3, ref4], but also cultural changes related to data management and governance within the organization. These two perspectives form the basis of the wealth business data ecosystem. Thus, the presented framework describes the hierarchy of people who are allowed to view and share data, rules for data collection, data privacy, data security standards, and channels through which data can be collected. Ultimately, this framework will help users be more consistent in data collection and data quality for reliable and accurate results of data-driven actions and activities.

Sounds interesting? Read the paper -> here (to be cited as: Azeroual, O., Nikiforova, A., Sha, K. (2023, June). Overlooked aspects of data governance: workflow framework for enterprise data deduplication. In 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023). IEEE (in print))

International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023) is collocated with The International Conference on Multimedia Computing, Networking and Applications (MCNA2023), which are sponsored by IEEE (IEEE Espana Seccion), Universitat Politecnica de Valencia, Al ain University. Great thanks to the organizers – Jaime Lloret, Universitat Politècnica de València, Spain & Yaser Jararweh, Jordan University of Science and Technology, Jordan & Marios C. Angelides, Brunel University London, UK & Muhannad Quwaider, Jordan University of Science and Technology, Jordan.

References:

Azeroual, O., Nikiforova, A., Sha, K. (2023, June). Overlooked aspects of data governance: workflow framework for enterprise data deduplication. In 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023). IEEE (in print).

Hildebrandt, K., Panse, F., Wilcke, N., & Ritter, N. (2017). Large-scale data pollution with Apache Spark. IEEE Transactions on Big Data, 6(2), 396-411

Guerra-García, C., Nikiforova, A., Jiménez, S., Perez-Gonzalez, H. G., Ramírez-Torres, M., & Ontañon-García, L. (2023). ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards Data Quality by Design. Data & Knowledge Engineering, 145, 102152.

Corrales, D. C., Ledezma, A., & Corrales, J. C. (2016). A systematic review of data quality issues in knowledge discovery tasks. Revista Ingenierías Universidad de Medellín, 15(28), 125-150.

HackCodeX Forum Keynote “Data Quality as a prerequisite for you business success: when should I start taking care of it?”

June 5, I was delighted to be invited to be a keynote at the HackCodeX Forum, delivering a keynote titledData Quality as a prerequisite for your business success: when should I start taking care of it? in my hometown – Riga, Latvia. HackCodeX Forum is a one-day event where international experts share their experience and knowledge about emerging technologies and areas such as Artificial Intelligence, Security, Data Quality, Quantum Computing, Sustainability, Open Data, Privacy, Ethics, Digital Services (with a keynote from CEO of SK ID Solutions – one of the solutions that make Estonia the #1 digital nation) etc. This time I was invited to cover the topic of Data Quality and I was happy to do so, especially considering the fact that the HackCodeX Forum is an event that closes one of the leading hackathons in Europe, which Riga was fascinated and passionated about, and this is evidenced by the rich list of advertisement we all saw in the last weeks and months (Delfi, Haker.lv, kripto.media, kursors.lv, labsoflatvia.lv to name just a few), which this year held in Latvia and brought together around 500 developers, designers and entrepreneurs to create and innovate, solving 5 challenges of this year:

  • 🏆 ATEA challenge: Minimise manual work and drive data-powered decision-making
  • 🏆 Emergn challenge: Improve the quality of life for people with disabilities
  • 🏆 UI.COM & Riga TechGirls challenge: Help shoppers make more sustainable purchasing decisions 
  • 🏆 Game Changer Audio (GCA) challenge: Identify each individual note by listening to notes being played real-time
  • 🏆 Ministry of Education and Science challenge: Help make education hackable again!

Form me, in turn, yet another audience, yet another experience.

In short, in this Star Wars-style presentation (yes, I am a fan, and given the number of DQ memes in this style, I am not an exception and cannot say that I am a geek or a weird person, but rather a normal DQ/IT person), I urged “help R2D2 save the galaxy!“.

Images from: History in Objects: Death Star Plans Datacard • Lucasfilm, Video Analysis of an Exploding Death Star | WIRED, Post | LinkedIn, Destruction of Despayre | Wookieepedia | Fandom. Special thanks to George Firican for the idea and inspiration!

In a bit more detail, I elaborated on the importance and the relevance of the data quality regardless of the age of this topic [that is older than me], data quality management and the factors the DQM approach depends on. The popularity and importance of the topic is undoubtfully due to the amount of the data we are dealing with and the fact that we are living in the data-driven world, where data are everywhere – they are generated continuously, by multiple sources, which is not only about our devices, or sensors, but also about ourselves (however, with the help of the two above). This led to the fact that some time ago data have been claimed to be a new oil. Have you heard this? I am sure you were. But have you thought about this statement? is it true? false? something in between? Bingo! While there are commonalities between data and oil, they are rather small in number. One interesting reading devoted to this comes form Forbes. I.e. they admit that both artifacts – oil and data – can be seen as similar since both are “power”, including being the power of those, who own them. In other words, they compare data owners such as Alibaba, Google, Twitter, Facebook etc. to oil barons (100 years back from now). But, otherwise, more in-depth comparative analysis reveal mostly differences. To name just a few:

💡 oil is a finite resource, while data are not. Instead, data are effectively infinitely durable and reusable and treating them like oil, i.e. storing in siloes, reduces their value, usefulness and potential as whole;

💡another difference is in transportation, where oil requires huge amounts of resources to be transported to where and when it is needed, while for the data – they can be replicated indefinitely and moved around the world at very high speeds and, more importantly, at very low costs;

💡 Yet another difference lies in the usability of both – oil and data – when they have been already used once. While for the oil, when it is used, its energy is being lost (as heat or light), or permanently converted into another form such as plastic, data usefulness, in contrast, tend to increase with their actual usage, i.e. new uses arose, data are turned into training data at the very end etc.;

💡 as the world’s oil reserves dwindle, extracting it become increasingly difficult and expensive, while for the data – they are becoming increasingly available, incl. but not limited due to the technology advances as well as due to a high number and amount of data producers;

💡 and the last but not the least, oil drilling involves causing damage to the natural environment and exploitation of finite natural resources, while data mining doesn’t – at least there is no intrinsic damage to the environment and exploitation of finite natural resources. Of course, here we do not mention (but should not forget about) the electricity used to run the system and relatively low tendency of green computing (aka sustainable computing) for their further processing.

Thus, as Forbes suggests, if we want to talk about the data as a power source or fuel, it make much more sense to compare them with renewable sources 🌎🌎🌎 such as the sun ☀️, wind 💨 and tides 🌊. All in all, data can be seen to be more than oil. Hence the popularity and importance of the data quality topic.

The factors that can affect the DQM approach, in turn, can be different, starting with those implying from the relative nature of the data quality as a phenomenon, i.e., the definition, variety of (and non-ambiguity of) data quality dimensions, to which the data quality metrics are expected to be selected, DQ dynamism, dependence on the user and use-case etc. (some of the above are discussed in Towards a data quality framework for EOSC and “Definition and Evaluation of Data Quality: a user-oriented data object-driven approach to data quality assessment”), as well as the data artifact whose quality is under analysis. In other words, is this about the data object or dataset? Database? Data repository? Information system?

If it is a data object, the next “level” of factors is data owner – known or unknown (third-party data such as open data), and their structure – structured, semi-structured, unstructured data?

While for the Information Systems / Software, I find that “think data quality first” and “data quality by design” are two mantras to be kept in mind. The later, however, is something we have studied together with my colleagues from Mexico , coming up with this modification of “quality by design” principle into “data quality by design”. I reported on the respective study before – “ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards data quality by design” (read here), where we proposed DAQUAVORD – a Methodology for Project Management of Data Quality Requirements Specification, which is based on the Viewpoint-Oriented Requirements Definition (VORD) method, and the latest and most generally accepted ISO/IEC 25012 standard, whose main idea was to start thinking of data quality as soon as the development of the system start to make sure that some data quality level is ensured by the design, i.e. transformed into both functional and non-functional requirements.

Alternatively, it can be done not necessarily before, but also during the development or even when the system is already in production. Some solutions exist here, but I typically use the opportunity to self-advertise previous projects and studies that I worked on, especially this one since it was based on the results of my PhD thesis, which is summarized “Definition and Evaluation of Data Quality: a user-oriented data object-driven approach to data quality assessment”, namely, Data Quality Model-based testing approach (DQMBT) for testing information systems that uses the data object-driven data quality model as a testing model, which was presented in the context of e-scooter system and Insurance System. Both, however, are rather ad-hoc approaches, whose main value lies in the conceptual idea, not the implementation, at least at this point.

For the repository, in turn, whether it is about the data warehouse, data lake? Or maybe even data lakehouse? For the later two, metadata and data governance become “must” to avoid GIGO (garbage in – garbage out effect) and turning the data lake into a data swamp, which is slightly addressed in “Combining data lake an data wrangling for ensuring data quality in CRIS“, incl. but not limited elaborating on why data wrangling should be given the preference over data cleaning.

The importance of both metadata and data governance was then emphasized, where for the later, the support from Elon Musk has been asked 😀 He was rather mentioned to support the speculations of data governance importance, which was once mentioned by him as a key to improve the product you are delivering, and I just wanted to make my words a bit more authoritative, i.e. he is seen to be more or less successful businessman, isn’t he? 😀

You can find slides here or watch the video 👇

Big thanks to both the organizers – Helve, and supporters, who made both the hackathon and the forum a success. More precisely, Techchill, techhub, Lift 99, #RigaTechGirls, justjoin.it, Oradea.Tech.Hub, RTU design Factory, Startup Lithuania, Kaunas Technology University, Stratup Estonia, Spring Hub. kood / Johvi, Technopol, Enterprise Forum CEE, Slush, Aaltos, AWS (Amazon Web Services), Google for Startups, Junction, Bird Incubator, EdTech Estonia. Sphere,it, Codecamp, Nine brains, Draper Startup House, Eiropas Digitālās inovācijas centrs, 28Stone.

And some more very special actors of the community, who were in the core of this hackathon edition – Emergn, Izglītības un zinātnes ministrija (Ministry of Education and Science), EPAM Systems Latvia, Atea Global Services Ltd.Ubiquiti Inc. & RigaTechGirls, Investment and Development Agency of Latvia (LIAA).

💬💬💬 Contributed talk for QWorld Quantum Science Days 2023 (QSD 2023)

In the very last days of May 2023, I had yet another experience – I delivered a contributed talk at QWorld Quantum Science Days 2023 (QSD 2023) titled “Framework for understanding quantum computing use cases from a multidisciplinary perspective and future research directions” (Ukpabi, D.C., Karjaluoto, H., Botticher, A., Nikiforova, A., Petrescu, D.I., Schindler, P., Valtenbergs, V., Lehmann, L., & Yakaryılmaz, A), which, in fact, is based on the paper we made publicly available some time ago and developed it even earlier when together with Germany, Spain, Finland, Romania, and Latvia we built a consortia and submitted a project proposal to CHANSE call “Transformations: Social and Cultural Dynamics in the Digital Age”. We went there much far beyond my expectations, i.e. in fact, we were notified that this time we will not be granted the funding for the project at the very last stage, having gone through all those intermediate evaluation rounds, which were already fascinating news (at least for me). While working on the proposal and building our network, we conducted a preliminary analysis of the area, which then, regardless of the output of the application, we decided to continue and bring to at least some logical end. We like our result so decided to make it publicly available. And now, a few years from that, we submitted our work to QWorld Quantum Science Days 2023 (QSD 2023) and were accepted. It was a big surprise, and I, as the person delegated by our team to present our study, delivered this talk, where I finally familiarized the audience with our findings. What was my surprise when my talk, which followed immediately after the keynote “Let’s talk about Quantum; Societal readiness through science communication research” delivered on behalf of Quantum DELTA NL by Julia Cramer, was in the very similar direction? It is also worth mentioning a very interesting coincidence that while the keynote elaborated on the DELTA that stands for five major quantum hubs, namely Delft, Eindhoven, Leiden, Twente, Amsterdam, I was preparing the last things for my presentation located in the Delta building – it is the name of the building my office is located in. In both cases, no connection with COVID-19 😀

🤔 What is the paper about?

There has been increasing awareness of the tremendous opportunities inherent in quantum computing. It is expected that the speed and efficiency of quantum computing will significantly impact the Internet of Things, cryptography, finance, and marketing. Accordingly, there has been increased quantum computing research funding from national and regional governments and private firms. However, ❗❗❗ critical concerns regarding legal, political, and business-related policies germane to quantum computing adoption exist ❗❗❗

Since this is an emerging and highly technical domain, most of the existing studies focus heavily on the technical aspects of quantum computing. In contrast, our study highlights its practical and social uses cases, which are needed for the increased interest of governments. More specifically, our study offers a multidisciplinary review of quantum computing, drawing on the expertise of scholars from a wide range of disciplines whose insights coalesce into a framework that simplifies the understanding of quantum computing, identifies possible areas of market disruption and offer empirically based recommendations that are critical for forecasting, planning, and strategically positioning QCs for accelerated diffusion.

"Framework for understanding quantum computing use cases from a multidisciplinary perspective and future research directions" (Ukpabi, D.C., Karjaluoto, H., Botticher, A., Nikiforova, A., Petrescu, D.I., Schindler, P., Valtenbergs, V., Lehmann, L., & Yakaryılmaz, A)

To this end, we conducted a gray literature research, whose outputs were then structured in accordance with Dwivedi et al., 2021 (Dwivedi et al. (2021). Setting the future of digital and social media marketing research: Perspectives and research propositions. International Journal of Information Management, 59, 102168), which embodies three broad areas—environment, users, and application areas—and the dominant sub-themes presented in figure below. We found that for application areas, business and finance, renewable energy, medicine & pharmaceuticals, and manufacturing are now the hottest. While for environment, we found subdomains such as ecosystem, security, jurisprudence, institutional change & geopolitics. And for the users, nothing surprising – as typically, customers, firms, countries. We then dive into each of those areas, as well as later come up with the most popular topics, the most promising, and overlooked.

Sounds interesting? Read the paper here, find slides here, watch video here.

Quantum Science Days is an annual, international, and virtual scientific conference organized by QWorld (Association) to provide opportunities to the quantum community to present and discuss their research results at all levels (from short projects to thesis work to research publications), and to get to know each other. The third edition (QSD2023) included 7 invited speakers, 10 thematic talks on “Building an Open Quantum Ecosystem”, 31 contributed talks, an industrial demo session by Classiq, and a career talk on quantum. QSD2023 was sponsored by Unitary Fund & Classiq and supported by Latvian Quantum Initiative.

Qworld

HackCodeX and my role of the Keynote speaker with “Data Quality as a prerequisite for business success: when should I start taking care of it?”

June 3-5, 2023 HackCodeX hackathon – a weekend-long experience, gathering tech enthusiasts from all over Europe to develop new ideas with the latest technology in a unique environment and atmosphere that serves as the meeting place for like-minded developers, designers, and business professionals eager to explore new paths – takes place. It will be accompanied by a one-day tech industry conference gathering international and local experts to discuss the key trends in technology and computing, where I act as one of those keynotes invited to deliver the talk devoted to the Data Quality that I entitled “Data Quality as a prerequisite for business success: when should I start taking care of it?

A full weekend of hacking and catching up with more than 400 other tech heads in an atmosphere and spirit like no other.

Sounds interesting? 🗓️ 🗓️ 🗓️ Save the date! June 5, 2023 – topics focused on the key topics, such as emerging technologies, security and privacy, hardware and infrastructure, data quality, quantum technologies, as well as language and framework updates. From the growing interest in Artificial Intelligence to practical tips for securing software applications and dwelling into performance optimization – HackCodeX aims to inspire and provide new knowledge for tech experts of all levels. 

Stay tuned and do not miss the opportunity to attend and meet me there! Read more here -> https://www.hackcodex.eu/forum

Rii Forum 2023 “Innovation 5.0: Navigating shocks and crises in uncertain times Technology-Business-Society” & a plenary debate “Advances in ICT & the Society”

Last week, I had an unforgettable experience at the Research and Innovation Forum (RiiForum) on which I posted previously in Krakow, Poland, serving as plenary speaker and session chair. It was another great experience to have an absolutely amazing plenary session titled “Advances in ICT & the Society: threading the thin line between progress, development and mental health”, where we – Prof. Dr. Yves Wautelet, Prof. Dr. Marek Krzystanek, Karolina Laurentowska & Prof. Marek Pawlicki – discussed disruptive technologies in our professional lives in the past years, how they affected us and our colleagues, how they affect(ed) society and its specific groups, including their mental health, and general perception of technology, i.e. an enemy of humanity, or rather a friend and support, and how to make sure the second take place. And from this we have developed a discussion around AI, chatGPT, Metaverse, blockchain, even slightly touching on quantum computing. Of course, all this was placed in the context of democracy and freedoms / liberties. All in all, we approached the topic of governance and policy-making, which is too often reactive rather than proactive, which, in turn, leads to many negative consequences, as well as elaborated on the engineering practices. 

To sum up – emerging and disruptive technologies, Blockchain, AI, Metaverse, digital competencies, education, liberty, democracy, openness, engagement, metaverse, inclusivity, Industry 5.0, Society 5.0 – and it is not a list of buzzwords, but a list of topics we have managed to cover both plenary speakers and the audience and continued to talk about them during the whole conference. Rich enough, isn’t it?

And then the day did not end, continuing with several super insightful sessions, where, of course, one I enjoyed most is the one that I chaired. Three qualitative talks with further rich discussion after each thanks to an excellent audience, despite the fact this was the last session of the day (before the dinner), namely:

  • Privacy in smart cities using VOSviewer: a bibliometric analysis by Xhimi Hysa, Gianluca Maria Guazzo, Vilma Cekani, Pierangelo Rosati
  • Public policy of innovation in China by Krzysztof Karwowski, Anna Visvizi
  • How Human-Centric solutions and Artificial Intelligence meet smart cities in Industry 5.0 by Tamai Ramirez, Sandra Amador, Antonio Macia-Lillo, Higinio Mora
     

And the last, but not the least, Krakow surprised me lot (in a positive sense, of course) – it was my first time in Poland, and I am absolutely glad that it was on such a beautiful city as Krakow – the place with the rich history and culture! Thank you dear RiiForum2023 organizers – Anna Visvizi, Vincenzo Corvello, ORLANDO TROISI, Mara Grimaldi, Giovanni Baldi and everyone who was involved – it is always a pleasure to be a part of this community!