UT & Swedbank Data Science Seminar “When, Why and How? The Importance of Business Intelligence”

Last week I had the pleasure of taking part in a Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence. In this seminar, organized by the Institute of Computer Science  (University of Tartu) in cooperation with Swedbank, we (me, Mohammad Gharib, Jurgen Koitsalu, Igor Artemtsuk) discussed the importance of BI with some focus on data quality. More precisely, 2 of 4 talks were delivered by representatives of the University of Tartu and were more theoretical in nature, where we both decided to focus our talks on data quality (for my talk, however, this was not the main focus this time), while another two talks were delivered by representatives of Swedbank, mainly elaborating on BI – what it can give, what it already gives, how it is achieved and much more. These talks were followed by a panel moderated by prof. Marlon Dumas.

In a bit more detail…. In my presentation I talked about:

  • Data warehouse vs. data lake – what are they and what is the difference between them?” – in a very few words – structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT. With further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons? 
  • Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
  • So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
  • But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
  • So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
  • Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
  • If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…

In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse, although not a super new concept, is still debated a lot, including but not limited to, on the definition of such).

However, my colleague Mohamad Gharib discussed what DQ and more specifically data quality requirements, why they really matter, and provided a very interesting perspective of how to define high quality data, which further would serve as the basis for defining these requirements.

All in all, although we did not know each other before and had a very limited idea of what each of us will talk about, we all admitted that this seminar turned out to be very coherent, where we and our talks, respectively, complemented each other, extending some previously touched but not thoroughly elaborated points. This allowed us not only to make the seminar a success, but also to establish a very lively discussion (although the prevailing part of this discussion took place during the coffee break – as it usually happens – so, unfortunately, is not available in the recordings, the link to which is available below).

The recordings are available here.

Editorial Board Member of Data & Policy (Cambridge University Press)

Since July 2022, I am elected by Syndicate of Cambridge University Press as an Editorial Board Member of the Cambridge University Journal Data & Policy. Data & Policy is a peer-reviewed, open access venue dedicated to the potential of data science to address important policy challenges. For more information about the goal and vision of the journal, read the Editorial Data & Policy: A new venue to study and explore policy–data interaction by Stefaan G. Verhulst, Zeynep Engin, and Jon Crowcroft. More precisely, I act as an Area Editor of “Focus on Data-driven Transformations in Policy and Governance” area (with a proud short name “Area 1“). This Area focuses on the high-level vision for philosophy, ideation, formulation and implementation of new approaches leading to paradigm shifts, innovation and efficiency gains in collective decision making processes. Topics include, but are not limited to:

  • Data-driven innovation in public, private and voluntary sector governance and policy-making at all levels (international; national and local): applications for real-time management, future planning, and rethinking/reframing governance and policy-making in the digital era;
  • Data and evidence-based policy-making;
  • Government-private sector-citizen interactions: data and digital power dynamics, asymmetry of information; democracy, public opinion and deliberation; citizen services;
  • Interactions between human, institutional and algorithmic decision-making processes, psychology and behaviour of decision-making;
  • Global policy-making: global existential debates on utilizing data-driven innovation with impact beyond individual institutions and states;
  • Socio-technical and cyber-physical systems, and their policy and governance implications.

The remaining areas represent more specifically the current applications, methodologies, strategies which underpin the broad aims of Data & Policy‘s vision: Area 2 “Data Technologies and Analytics for Policy and Governance“, Area 3 “Policy Frameworks, Governance and Management of Data-driven Innovations“, Area 4 “Ethics, Equity and Trust in Policy Data Interactions“, Area 5 “Algorithmic Governance“, Area 6 “Data to Tackle Global Issues and Dynamic Societal Threats“.

Editorial committees of Data & Policy (Area 1)

For the types of submission we are interested in, they are four:

  • Research articles that use rigorous methods that investigate how data science can inform or impact policy by, for example, improving situation analysis, predictions, public service design, and/or the legitimacy and/or effectiveness of policy making. Published research articles are typically reviewed by three peer reviewers: two assessing the academic or methodological rigour of the paper; and one providing an interdisciplinary or policy-specific perspective. (Approx 8,000 words in length).
  • Commentaries are shorter articles that discuss and/or problematize an issue relevant to the Data & Policy scope. Commentaries are typically reviewed by two peer reviewers. (Approx 4,000 words in length).
  • Translational articles are focused on the transfer of knowledge from research to practice and from practice to research. See our guide to writing translational papers. (Approx 6,000 words in length).
  • Replication studies examine previously published research, whether in Data & Policy or elsewhere, and report on an attempt to replicate findings.

Read more about Data & Policy and consider submitting your contribution!

Moreover, as a part of this journal, we (Data & Policy community) organize a hybrid physical-virtual format, with one-day, in-person conferences held in three regions: Asia (Hong Kong), America (Seattle) and Europe (Brussels). “Data for Policy: Ecosystems of innovation and virtual-physical interactions” conference I sincerely recommend you to consider and preferably to attend! While this is already the seventh edition of the conference, I take part in its organization for the first year, thus am especially excited and interested in its success!

Data for policy, Area Editors

In addition to its six established Standard Tracks, and reflecting its three-regions model this year, the Data for Policy 2022 conference highlights “Ecosystems of innovation and virtual-physical interactions” as its theme. Distinct geopolitical and virtual-physical ecosystems are emerging as everyday operations and important socio-economic decisions are increasingly outsourced to digital systems. For example, the US’s open market approach empowering multinational digital corporations contrasts with greater central government control in the Chinese digital ecosystem, and radically differs from Europe’s priority on individual rights, personal privacy and digital sovereignty. Other localised ecosystems are emerging around national priorities: India focuses on the domestic economy, and Russia prioritises public and national security. The Global South remains underrepresented in the global debate. The developmental trajectory for the different ecosystems will shape future governance models, democratic values, and the provision of citizen services. In an envisioned ‘metaverse’ future, boundaries between physical and virtual spaces will become even more blurred, further underlining the need to scrutinise and challenge the various systems of governance.

The Data for Policy conference series is the premier global forum for multiple disciplinary and cross-sector discussions around the theories, applications and implications of data science innovation in governance and the public sector. Its associated journal, Data & Policy, published by Cambridge University Press has quickly established itself as a major venue for publishing research in the field of data-policy interactions. Data for Policy is a non-profit initiative, registered as a community interest company in the UK, supported by sustainer partners Cambridge University Press, the Alan Turing Institute and the Office for National Statistics.

Read more about Data for Policy and become a part of it!