📢🚨⚠️Paper alert! Overlooked aspects of data governance: workflow framework for enterprise data deduplication

This time I would like to recommend for reading the new paper “Overlooked aspects of data governance: workflow framework for enterprise data deduplication” that has been just presented at the IEEE-sponsored International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023). This “just”, btw, means June 19 – the day after my birthday, i.e. so I decided to start my new year with one more conference and paper & yes, this means that again, as many of those who congratulated me were wishing – to find the time for myself, reach work-life balance etc., is still something I have to try to achieve, but this time, I decided to give a preference to the career over my personal life (what a surprise, isn’t it?) 🙂 Moreover, this is the conference, where I am also considered to be part of Steering committee, Technical Program committee, as well as publicity chair. During the conference, I also acted as a session chair of its first session, what I consider to be a special honor – for me the session was very smooth, interactive and insightful, of course, beforehand its participants & authors and their studies, which allowed us to establish this fruitful discussion and get some insights for our further studies (yes, I also got one beforehand one very useful idea for further investigation). Thank you all contributors, with special thanks to Francisco Bonilla Rivas, Bruck Wubete, Reem Nassar, Haitham Al Ajmi.

And I am also proud with getting one of four keynotes for this conference – prof. Eirini Ntoutsi from the Bundeswehr University Munich (UniBw-M), Germany, who delivered a keynote “Bias and Discrimination in AI Systems: From Single-Identity Dimensions to Multi-Discrimination“, which I heard during one of previous conferences I attended and decided that it is “must” for our conference as well – super glad that Eirini accepted our invitation! Here, I will immediately mention that other keynotes were excellent as well – Giancarlo Fortino (University of Calabria, Italy), Dofe Jaya (Computer Engineering Department, California State University, Fullerton, California, USA), Sandra Sendra (Polytechnic University of Valencia, Spain).

The paper I presented is authored in a team of three – Otmane Azeroual, German Centre for Higher Education Research and Science Studies (DZHW), Germany, myself – Anastasija Nikiforova, Faculty of Science and Technology, Institute of Computer Science, University of Tartu, Estonia & Task Force “FAIR Metrics and Data Quality”, European Open Science Cloud & Kewei Sha, College of Science and Engineering University of Houston Clear Lake, USA – very international team. So, what is the paper about? It is (or should be) clear that data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), product management, manufacturing, and marketing are used, duplicates, e.g., multiple entries for the same customer or product in a database or information system, occur. There can be several reasons for this (incl. but not limited due to the growing volume of data, incl. due to the adoption of cloud technologies, use of multiple different sources, the proliferation of connected personal and work devices in homes, stores, offices and supply chains), but the result of non-unique or duplicate records is a degraded data quality, which, in turn, ultimately leads to inaccurate analysis, poor, distorted or skewed decisions, distorted insights provided by Business Intelligence (BI) or machine learning (ML) algorithms, models, forecasts, and simulations, where the data form the input, and other data-driven activities such as service personalisation in terms of both their accuracy, trustworthiness and reliability, user acceptance / adoption and satisfaction, customer service, risk management, crisis management, as well as resource management (time, human, and fiscal), not to say about wasted resources, and employees, who are less likely trust the data and associated applications thereby affecting the company image. This, in turn, can lead to a failure of a project if not a business. At the same time, the amount of data that companies collect is growing exponentially, i.e., the volume of data is constantly increasing, making it difficult to effectively manage them. Thus, both ex-ante and ex-post deduplication mechanisms are critical in this context to ensure sufficient data quality and are usually integrated into a broader data governance approach. In this paper, we develop such a conceptual data governance framework for effective and efficient management of duplicate data, and improvement of data accuracy and consistency in medium to large data ecosystems. We present methods and recommendations for companies to deal with duplicate data in a meaningful way, while the presented framework is integrated into one of the most popular data quality tools – Data Cleaner.

In short, in this paper we:

  • first, present methods for how companies can deal meaningfully with duplicate data. Initially, we focus on data profiling using several analysis methods applicable to different types of datasets, incl. analysis of different types of errors, structuring, harmonizing, & merging of duplicate data;
  • second, we propose methods for reducing the number of comparisons and matching attribute values based on similarity (in medium to large databases). The focus is on easy integration and duplicate detection configuration so that the solution can be easily adapted to different users in companies without domain knowledge. These methods are domain-independent and can be transferred to other application contexts to evaluate the quality, structure, and content of duplicate / repetitive data;
  • finally, we integrate the chosen methods into the framework of Hildebrandt et al. [ref 2]. We also explore some of the most common data quality tools in practice, into which we integrate this framework.

After that, we test and validate the framework. The final refined solution provides the basis for subsequent use. It consists of detecting and visualizing duplicates, presenting the identified redundancies to the user in a user-friendly manner to enable and facilitate their further elimination.

With this paper we aim to support research in data management and data governance by identifying duplicate data at the enterprise level and meeting today’s demands for increased connectivity / interconnectedness, data ubiquity, and multi-data sourcing. In addition, the proposed conceptual data governance framework aims to provide an overview of data quality, accuracy and consistency to help practitioners approach data governance in a structured manner.

In general, not only technological solutions are needed that would identify / detect poor quality data and allow their examination and correction, or would ensure their prevention by integrating some controls into the system design, striving for “data quality by design” [ref3, ref4], but also cultural changes related to data management and governance within the organization. These two perspectives form the basis of the wealth business data ecosystem. Thus, the presented framework describes the hierarchy of people who are allowed to view and share data, rules for data collection, data privacy, data security standards, and channels through which data can be collected. Ultimately, this framework will help users be more consistent in data collection and data quality for reliable and accurate results of data-driven actions and activities.

Sounds interesting? Read the paper -> here (to be cited as: Azeroual, O., Nikiforova, A., Sha, K. (2023, June). Overlooked aspects of data governance: workflow framework for enterprise data deduplication. In 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023). IEEE (in print))

International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023) is collocated with The International Conference on Multimedia Computing, Networking and Applications (MCNA2023), which are sponsored by IEEE (IEEE Espana Seccion), Universitat Politecnica de Valencia, Al ain University. Great thanks to the organizers – Jaime Lloret, Universitat Politècnica de València, Spain & Yaser Jararweh, Jordan University of Science and Technology, Jordan & Marios C. Angelides, Brunel University London, UK & Muhannad Quwaider, Jordan University of Science and Technology, Jordan.

References:

Azeroual, O., Nikiforova, A., Sha, K. (2023, June). Overlooked aspects of data governance: workflow framework for enterprise data deduplication. In 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023). IEEE (in print).

Hildebrandt, K., Panse, F., Wilcke, N., & Ritter, N. (2017). Large-scale data pollution with Apache Spark. IEEE Transactions on Big Data, 6(2), 396-411

Guerra-García, C., Nikiforova, A., Jiménez, S., Perez-Gonzalez, H. G., Ramírez-Torres, M., & Ontañon-García, L. (2023). ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards Data Quality by Design. Data & Knowledge Engineering, 145, 102152.

Corrales, D. C., Ledezma, A., & Corrales, J. C. (2016). A systematic review of data quality issues in knowledge discovery tasks. Revista Ingenierías Universidad de Medellín, 15(28), 125-150.

HackCodeX Forum Keynote “Data Quality as a prerequisite for you business success: when should I start taking care of it?”

June 5, I was delighted to be invited to be a keynote at the HackCodeX Forum, delivering a keynote titledData Quality as a prerequisite for your business success: when should I start taking care of it? in my hometown – Riga, Latvia. HackCodeX Forum is a one-day event where international experts share their experience and knowledge about emerging technologies and areas such as Artificial Intelligence, Security, Data Quality, Quantum Computing, Sustainability, Open Data, Privacy, Ethics, Digital Services (with a keynote from CEO of SK ID Solutions – one of the solutions that make Estonia the #1 digital nation) etc. This time I was invited to cover the topic of Data Quality and I was happy to do so, especially considering the fact that the HackCodeX Forum is an event that closes one of the leading hackathons in Europe, which Riga was fascinated and passionated about, and this is evidenced by the rich list of advertisement we all saw in the last weeks and months (Delfi, Haker.lv, kripto.media, kursors.lv, labsoflatvia.lv to name just a few), which this year held in Latvia and brought together around 500 developers, designers and entrepreneurs to create and innovate, solving 5 challenges of this year:

  • 🏆 ATEA challenge: Minimise manual work and drive data-powered decision-making
  • 🏆 Emergn challenge: Improve the quality of life for people with disabilities
  • 🏆 UI.COM & Riga TechGirls challenge: Help shoppers make more sustainable purchasing decisions 
  • 🏆 Game Changer Audio (GCA) challenge: Identify each individual note by listening to notes being played real-time
  • 🏆 Ministry of Education and Science challenge: Help make education hackable again!

Form me, in turn, yet another audience, yet another experience.

In short, in this Star Wars-style presentation (yes, I am a fan, and given the number of DQ memes in this style, I am not an exception and cannot say that I am a geek or a weird person, but rather a normal DQ/IT person), I urged “help R2D2 save the galaxy!“.

Images from: History in Objects: Death Star Plans Datacard • Lucasfilm, Video Analysis of an Exploding Death Star | WIRED, Post | LinkedIn, Destruction of Despayre | Wookieepedia | Fandom. Special thanks to George Firican for the idea and inspiration!

In a bit more detail, I elaborated on the importance and the relevance of the data quality regardless of the age of this topic [that is older than me], data quality management and the factors the DQM approach depends on. The popularity and importance of the topic is undoubtfully due to the amount of the data we are dealing with and the fact that we are living in the data-driven world, where data are everywhere – they are generated continuously, by multiple sources, which is not only about our devices, or sensors, but also about ourselves (however, with the help of the two above). This led to the fact that some time ago data have been claimed to be a new oil. Have you heard this? I am sure you were. But have you thought about this statement? is it true? false? something in between? Bingo! While there are commonalities between data and oil, they are rather small in number. One interesting reading devoted to this comes form Forbes. I.e. they admit that both artifacts – oil and data – can be seen as similar since both are “power”, including being the power of those, who own them. In other words, they compare data owners such as Alibaba, Google, Twitter, Facebook etc. to oil barons (100 years back from now). But, otherwise, more in-depth comparative analysis reveal mostly differences. To name just a few:

💡 oil is a finite resource, while data are not. Instead, data are effectively infinitely durable and reusable and treating them like oil, i.e. storing in siloes, reduces their value, usefulness and potential as whole;

💡another difference is in transportation, where oil requires huge amounts of resources to be transported to where and when it is needed, while for the data – they can be replicated indefinitely and moved around the world at very high speeds and, more importantly, at very low costs;

💡 Yet another difference lies in the usability of both – oil and data – when they have been already used once. While for the oil, when it is used, its energy is being lost (as heat or light), or permanently converted into another form such as plastic, data usefulness, in contrast, tend to increase with their actual usage, i.e. new uses arose, data are turned into training data at the very end etc.;

💡 as the world’s oil reserves dwindle, extracting it become increasingly difficult and expensive, while for the data – they are becoming increasingly available, incl. but not limited due to the technology advances as well as due to a high number and amount of data producers;

💡 and the last but not the least, oil drilling involves causing damage to the natural environment and exploitation of finite natural resources, while data mining doesn’t – at least there is no intrinsic damage to the environment and exploitation of finite natural resources. Of course, here we do not mention (but should not forget about) the electricity used to run the system and relatively low tendency of green computing (aka sustainable computing) for their further processing.

Thus, as Forbes suggests, if we want to talk about the data as a power source or fuel, it make much more sense to compare them with renewable sources 🌎🌎🌎 such as the sun ☀️, wind 💨 and tides 🌊. All in all, data can be seen to be more than oil. Hence the popularity and importance of the data quality topic.

The factors that can affect the DQM approach, in turn, can be different, starting with those implying from the relative nature of the data quality as a phenomenon, i.e., the definition, variety of (and non-ambiguity of) data quality dimensions, to which the data quality metrics are expected to be selected, DQ dynamism, dependence on the user and use-case etc. (some of the above are discussed in Towards a data quality framework for EOSC and “Definition and Evaluation of Data Quality: a user-oriented data object-driven approach to data quality assessment”), as well as the data artifact whose quality is under analysis. In other words, is this about the data object or dataset? Database? Data repository? Information system?

If it is a data object, the next “level” of factors is data owner – known or unknown (third-party data such as open data), and their structure – structured, semi-structured, unstructured data?

While for the Information Systems / Software, I find that “think data quality first” and “data quality by design” are two mantras to be kept in mind. The later, however, is something we have studied together with my colleagues from Mexico , coming up with this modification of “quality by design” principle into “data quality by design”. I reported on the respective study before – “ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: Towards data quality by design” (read here), where we proposed DAQUAVORD – a Methodology for Project Management of Data Quality Requirements Specification, which is based on the Viewpoint-Oriented Requirements Definition (VORD) method, and the latest and most generally accepted ISO/IEC 25012 standard, whose main idea was to start thinking of data quality as soon as the development of the system start to make sure that some data quality level is ensured by the design, i.e. transformed into both functional and non-functional requirements.

Alternatively, it can be done not necessarily before, but also during the development or even when the system is already in production. Some solutions exist here, but I typically use the opportunity to self-advertise previous projects and studies that I worked on, especially this one since it was based on the results of my PhD thesis, which is summarized “Definition and Evaluation of Data Quality: a user-oriented data object-driven approach to data quality assessment”, namely, Data Quality Model-based testing approach (DQMBT) for testing information systems that uses the data object-driven data quality model as a testing model, which was presented in the context of e-scooter system and Insurance System. Both, however, are rather ad-hoc approaches, whose main value lies in the conceptual idea, not the implementation, at least at this point.

For the repository, in turn, whether it is about the data warehouse, data lake? Or maybe even data lakehouse? For the later two, metadata and data governance become “must” to avoid GIGO (garbage in – garbage out effect) and turning the data lake into a data swamp, which is slightly addressed in “Combining data lake an data wrangling for ensuring data quality in CRIS“, incl. but not limited elaborating on why data wrangling should be given the preference over data cleaning.

The importance of both metadata and data governance was then emphasized, where for the later, the support from Elon Musk has been asked 😀 He was rather mentioned to support the speculations of data governance importance, which was once mentioned by him as a key to improve the product you are delivering, and I just wanted to make my words a bit more authoritative, i.e. he is seen to be more or less successful businessman, isn’t he? 😀

You can find slides here or watch the video 👇

Big thanks to both the organizers – Helve, and supporters, who made both the hackathon and the forum a success. More precisely, Techchill, techhub, Lift 99, #RigaTechGirls, justjoin.it, Oradea.Tech.Hub, RTU design Factory, Startup Lithuania, Kaunas Technology University, Stratup Estonia, Spring Hub. kood / Johvi, Technopol, Enterprise Forum CEE, Slush, Aaltos, AWS (Amazon Web Services), Google for Startups, Junction, Bird Incubator, EdTech Estonia. Sphere,it, Codecamp, Nine brains, Draper Startup House, Eiropas Digitālās inovācijas centrs, 28Stone.

And some more very special actors of the community, who were in the core of this hackathon edition – Emergn, Izglītības un zinātnes ministrija (Ministry of Education and Science), EPAM Systems Latvia, Atea Global Services Ltd.Ubiquiti Inc. & RigaTechGirls, Investment and Development Agency of Latvia (LIAA).

CFP: The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2023)

On behalf of the organizers and as a publicity chair, I sincerely invite you to consider submitting the results of your recent research to The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2023), which will be held in conjunction Kuwait Fintech and Blockchain Summit.

Huge amount of data is being generated and transmitted everyday. To be able to deal with this data, extract useful information from it, store it, transmit it, and represent it, intelligent technologies and applications are needed. The International Conference on Intelligent Data Science Technologies and Applications (IDSTA) is a peer reviewed conference, whose objective is to advance the Data Science field by giving an opportunity for researchers, engineers, and practitioners to present their latest findings in the field. It will also invite key persons in the field to share their current knowledge and their future expectations for the field. Topics of interest for submission include, but are not limited to:

💡Applied Public Affairs, incl. but not limited to Campaign Management, Mass Communication Politics, Political Analysis, Survey Sampling
💡Business Analytics, incl. but not limited to Stock Market Analysis, Predictive Analytics, Business Intelligence
💡Finance, incl. but not limited to Risk Management, Algorithmic Trading, Fraud Detection, Financial Analysis
💡Computer Science, incl. but not limited to Database Management Systems, Scientific Computing, Computer Vision, Fuzzy Computing, Feature Selection, Neural Networks, Deep Learning, Meta-Learning, Process Mining, Artificial Intelligence, Data Mining, Big Data, Web Analytics, Text Mining, Natural Language Processing, Sentiment Analysis, Social Media Analysis, Data Fusion, Performance Analysis and Evaluation, Evolutionary Computing and Optimization, Hybrid Methods, Granular Computing, Recommender Systems, Data Visualization, Predictive Maintenance, Internet of Things (IoT), Web Scraping
💡Sustainability, incl. but not limited to Datasets on Sustainability, Sustainability Modeling, Energy Sustainability, Water Sustainability, Environmental Sustainability, Risk Analysis
💡Cybersecurity, incl. but not limited to Data Privacy and Security, Network Security, Communication Security, Cryptography, Fraud Detection, Blockchain
💡Environmental Science, incl. but not limited to GIS, Climatographic, Remote Sensing, Spatial Data Analysis, Weather Prediction and Tracking,
💡Biotechnologies, incl. but not limited to Gnome Analysis, Drug Discovery and Screening and Side Effect Analysis, Structural and Folding Pattern, Disease Discovery and Classification, Bioinformatics, Next-Gen Sequencing
💡Smart City, incl. but not limited to City Data Management, Smart Traffic, Surveillance, Location-Based Services, Robotics
💡Human Behaviour Understanding
💡Semi-Structured and Unstructured Data
💡Pattern Recognition
💡Transparency in Research Data
💡Data and Information Quality
💡GPU Computing
💡Crowdsourcing


🗓️🗓️🗓️ IMPORTANT DATES

  • Paper submission:  March 15, 2023  
  • Acceptance notification:  May 20th, 2023
  • Full paper camera-ready submission: October 1st, 2023
    Conference Dates: October 24-26, 2023

All papers that are accepted, registered, and presented in IDSTA2023 and the workshops co-located with it will be submitted to IEEEXplore for possible publication. 
For any inquiries, contact intelligenttechorg@gmail.com.

Submit the paper and meet our team in Kuwait in October, 2023!
 

With best wishes,

IDSTA2023 organizers