UT & Swedbank Data Science Seminar “When, Why and How? The Importance of Business Intelligence”

Last week I had the pleasure of taking part in a Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence. In this seminar, organized by the Institute of Computer Science  (University of Tartu) in cooperation with Swedbank, we (me, Mohammad Gharib, Jurgen Koitsalu, Igor Artemtsuk) discussed the importance of BI with some focus on data quality. More precisely, 2 of 4 talks were delivered by representatives of the University of Tartu and were more theoretical in nature, where we both decided to focus our talks on data quality (for my talk, however, this was not the main focus this time), while another two talks were delivered by representatives of Swedbank, mainly elaborating on BI – what it can give, what it already gives, how it is achieved and much more. These talks were followed by a panel moderated by prof. Marlon Dumas.

In a bit more detail…. In my presentation I talked about:

  • Data warehouse vs. data lake – what are they and what is the difference between them?” – in a very few words – structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT. With further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons? 
  • Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
  • So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
  • But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
  • So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
  • Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
  • If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…

In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse, although not a super new concept, is still debated a lot, including but not limited to, on the definition of such).

However, my colleague Mohamad Gharib discussed what DQ and more specifically data quality requirements, why they really matter, and provided a very interesting perspective of how to define high quality data, which further would serve as the basis for defining these requirements.

All in all, although we did not know each other before and had a very limited idea of what each of us will talk about, we all admitted that this seminar turned out to be very coherent, where we and our talks, respectively, complemented each other, extending some previously touched but not thoroughly elaborated points. This allowed us not only to make the seminar a success, but also to establish a very lively discussion (although the prevailing part of this discussion took place during the coffee break – as it usually happens – so, unfortunately, is not available in the recordings, the link to which is available below).

The recordings are available here.

Research and Innovation Forum 2022: panel organizer, speaker, PC member, moderator and Best panel moderator award

As I wrote earlier, this year I was invited to organize my own panel session within the Research and Innovation Forum (Rii Forum). This invitation was a follow-up on several articles that I have recently published (article#1, article#2, article#3) and a Chapter to be published in “Big data & decision-making: how big data is relevant across fields and domains” (Emerald Studies in Politics and Technology) I was developing at that time. I was glad to accept this invitation, but I did not even think about how many roles I will act in Rii Forum and how many emotions I will experience. So, how was it?

First, what was my panel about? It was dedicated to data security entitled “Security of data storage facilities: is your database sufficiently protected?” being a part of the track called “ICT, safety, and security in the digital age: bringing the human factor back into the analysis“.

My own talk was titled “Data security as a top priority in the digital world: preserve data value by being proactive and thinking security first“, which makes it to be a part of the panel described above. In this talk I elaborated on the main idea of the panel, referring to an a study I recently conducted. In short, today, in the age of information and Industry 4.0, billions of data sources, including but not limited to interconnected devices (sensors, monitoring devices) forming Cyber-Physical Systems (CPS) and the Internet of Things (IoT) ecosystem, continuously generate, collect, process, and exchange data. With the rapid increase in the number of devices and information systems in use, the amount of data is increasing. Moreover, due to the digitization and variety of data being continuously produced and processed with a reference to Big Data, their value, is also growing. As a result, the risk of security breaches and data leaks. The value of data, however, is dependent on several factors, where data quality and data security that can affect the data quality if the data are accessed and corrupted, are the most vital. Data serve as the basis for decision-making, input for models, forecasts, simulations etc., which can be of high strategical and commercial / business value. This has become even more relevant in terms of COVID-19 pandemic, when in addition to affecting the health, lives, and lifestyle of billions of citizens globally, making it even more digitized, it has had a significant impact on business. This is especially the case because of challenges companies have faced in maintaining business continuity in this so-called “new normal”. However, in addition to those cybersecurity threats that are caused by changes directly related to the pandemic and its consequences, many previously known threats have become even more desirable targets for intruders, hackers. Every year millions of personal records become available online. Moreover, the popularity of IoTSE decreased a level of complexity of searching for connected devices on the internet and easy access even for novices due to the widespread popularity of step-by-step guides on how to use IoT search engine to find and gain access if insufficiently protected to webcams, routers, databases and other artifacts. A recent research demonstrated that weak data and database protection in particular is one of the key security threats. Various measures can be taken to address the issue. The aim of the study to which this presentation refers is to examine whether “traditional” vulnerability registries provide a sufficiently comprehensive view of DBMS security, or whether they should be intensively and dynamically inspected by DBMS holders by referring to Internet of Things Search Engines moving towards a sustainable and resilient digitized environment. The study brings attention to this problem and make you think about data security before looking for and introducing more advanced security and protection mechanisms, which, in the absence of the above, may bring no value.

Other presentations delivered during this session were “Information Security Risk Awareness Survey of non-governmental Organization in Saudi Arabia”, “Fake news and threats to IoT – the crucial aspects of cyberspace in the times of cyber war” and “Minecraft as a Tool to Enhance Engagement in Higher Education” – both were incredibly interesting, and all three talks were delivered by females, where only the moderator of the session was a male researcher, which he found to be very specific, given the topic and ICT orientation – not a very typical case 🙂 But, nevertheless, we managed to have a great session and a very lively and fruitful discussion, mostly around GDPR-related questions, which seems to be one of the hottest areas of discussion for people representing different ICT “subbranches”. The main question that we discussed was – is the GDPR more a supportive tool and a “great thing” or rather a “headache” that sometimes even interferes with development.

In addition, shortly before the start of the event, I was asked to become a moderator of the panel “Business in the era of pervasive digitalization“. Although, as you may know, this is not exactly in line with my area of expertise, it is in line with what I am interested in. This is not surprising, since both management, business, the economics are very closely connected and dependent on ICT. Moreover, they affect ICT, thereby pointing out the critical areas that we as IT-people need to refer to. All in all, we had a great session with excellent talks and lively discussion at the end of the session, where we discussed different session-related topics, shared our experience, thoughts etc. Although it was a brilliant experience, there is one thing that made it even better… A day later, a ceremony was held where the best contributions of the forum were announced and I was named the best panel moderator as a recognition of “the academic merit, quality of moderation, scheduling, and discussion held during the panel”!!!

These were wonderful three days of the forum with very positive emotions and so many roles – panel organizer, speaker / presenter, program committee member and panel moderator with the cherry on the cake and such a great end of the event. Thank you Research and Innovation Forum!!! Even being at home and participating online, you managed to give us an absolute amazing experience and even the feeling that we were all together in Athens!