UT & Swedbank Data Science Seminar “When, Why and How? The Importance of Business Intelligence”

Last week I had the pleasure of taking part in a Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence. In this seminar, organized by the Institute of Computer Science  (University of Tartu) in cooperation with Swedbank, we (me, Mohammad Gharib, Jurgen Koitsalu, Igor Artemtsuk) discussed the importance of BI with some focus on data quality. More precisely, 2 of 4 talks were delivered by representatives of the University of Tartu and were more theoretical in nature, where we both decided to focus our talks on data quality (for my talk, however, this was not the main focus this time), while another two talks were delivered by representatives of Swedbank, mainly elaborating on BI – what it can give, what it already gives, how it is achieved and much more. These talks were followed by a panel moderated by prof. Marlon Dumas.

In a bit more detail…. In my presentation I talked about:

  • Data warehouse vs. data lake – what are they and what is the difference between them?” – in a very few words – structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT. With further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons? 
  • Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
  • So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
  • But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
  • So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
  • Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
  • If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…

In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse, although not a super new concept, is still debated a lot, including but not limited to, on the definition of such).

However, my colleague Mohamad Gharib discussed what DQ and more specifically data quality requirements, why they really matter, and provided a very interesting perspective of how to define high quality data, which further would serve as the basis for defining these requirements.

All in all, although we did not know each other before and had a very limited idea of what each of us will talk about, we all admitted that this seminar turned out to be very coherent, where we and our talks, respectively, complemented each other, extending some previously touched but not thoroughly elaborated points. This allowed us not only to make the seminar a success, but also to establish a very lively discussion (although the prevailing part of this discussion took place during the coffee break – as it usually happens – so, unfortunately, is not available in the recordings, the link to which is available below).

The recordings are available here.

AI for Open Data or Open Data for AI? An invited talk for BBDU Development Program «Artificial Intelligence for Sustainable Development»🎤

Recently I was honored to contribute to Babu Banarasi Das University (BBDU, Department of Computer Science and Engineering) Development Program «Artificial Intelligence for Sustainable Development» with the talk entitled “Artificial Intelligence for Open Data or Open Data for Artificial Intelligence?”. More precisely, this series of workshops is organized for the industry, i.e. representatives of industry, who want to get an insight on the current advances in various topic-related areas (AI in the sustainability context) from people representing research and academia, which is organized by AI Research Centre, Department of Computer Science & Engineering, Babu Banarasi Das University (India), ShodhGuru Research Labs, Soft Computing Research Society, IEEE UP Section, Computational Intelligence Society Chapter. My session, for instance, was attended by more than 130 attendees, which I consider to be a very good rate!


Regarding my talk, I was delighted to deliver in the last day of this event, being also a guest of honor for this event, when we speak about “Artificial Intelligence for Open Data or Open Data for Artificial Intelligence?” – in short, not OR but rather AND. In other words, AI for Open Data and Open Data for AI, where open data serves as a valuable asset for AI (of course, if a list of prerequisites is fulfilled), while AI defines new prerequisites for open data we should think of.

At the same time, although their combination is considered to play a transformational role in human society, and especially in prominent areas, as we discussed today, this “magic duo” is not always about “unicorns and ice creams“, where the current state-of-the-art suggests that open data my pose also certain risks.

Probably the most expressive example of such, I referred to, is an example, when based on easily obtainable open data on toxic molecules collected over the years, AI has managed to create 40,000 molecular associations potentially usable as biochemical weapons in just 6 hours. And while not all of them are actually usable, and the need to synthesize them still remains, some associations correspond to known chemical weapons with one even more toxic than the VX nerve gas, identified as a weapon of mass destruction by the United Nations.

So here comes a very interesting dilemma between openness as a philosophy and making data open, and threats it may pose, if used by a malevolent actor.

We also briefly touched a topic of risks associated with AI (although both perspectives of so-called cyber-pessimists and cyber-optimists in this regard were considered), open data, and their combination, along with the long list of benefits they can bring, including their contribution to the sustainability being in line with the general idea of this event.
And, of course, we could not ignore the topic of green AI and a strong need to consider FATE principles (Fairness, Accountability, Transparency & Explainability).

All in all, it was a very nice experience and the audience so curious and passionate of topics elaborated on within this 6-days long event with speakers from both continents Asia, Africa, America and Europe (represented by me! 🤓🤓🤓). Exceptional audience with so relevant questions leading to a lively and fruitful discussion being of interest for both participants and speakers. Glad to be part of it and get this experience!

This is just in a few words, although at some point I plan to extend this post with more details and thoughts.

Several summer activities – African Smart Cities Lab, One Conference 2022, EFSA & EBTC joint project, Guest Lecture, “Virtual Brown Bag Lunch”

Considering that in last weeks I was pretty active in delivering very many talks, let me use this post to summarize some of them thereby remaining them in my memory as well as allowing you, my dear reader, to pick up some ideas or navigate to some projects (both projects, initiatives, postgraduate programs, joint workshops or “lunchs” for business and academia) of your interest. So this post is less about self-advertisement and my role in the below discussed events as both panelist, keynote, guest lecturer, invited speaker and expert, but more about very interesting projects, initiatives and labs currently running in different countries and at different scales – local, national, regional and international. And as “thank you” for the organizers of each of them, I would like to shed a light on them in this post, drawing your attention to them!

All in all, this post is about participating as a panelist for One Conference 2022, keynote for African Smart Cities Lab projects’ workshop (Morocco, Ghana, Tunisia, South Africa, Rwanda, Benin, Switzerland), Guest Lecture for master and doctoral students of the Federal University of Technology – Paraná (UTFPR, Postgraduate Program in Production Engineering, Brasil), and invited speaker / expert for monthly “Virtual Brown Bag Lunch” (Mexico), and EFSA & EBTC joint project (Italy) on the creation of a standard for data exchange in support of automation of Systematic Review.

So, let’s start with the most spontaneous, namely “Integration of open data and artificial intelligence in the development of smart cities in Africa” workshop organized as part of the African Cities Lab Project, where I was invited as a keynote speaker. Actually, African Smart Cities Lab project is a very interesting initiative I recently was glad to get familiar with. It is a joint initiative led by École polytechnique fédérale de Lausanne (Switzerland), the Kwame Nkrumah’​ University of Science and Technology, Kumasi (Ghana), the UM6P – Mohammed VI Polytechnic University (Maroc), Sèmè City campus (Benin), the Faculty of Sciences of Bizerta – University of Carthage (Tunisia), the University of Cape Town (South Africa), and the University of Rwanda that aims to create a digital education platform on urban development in Africa, offering quality MOOC and online, continuing education training for professionals. It is also expected to act as a forum for the exchange of digital educational resources and the management and governance of African cities to foster sustainable urban development. The very first workshop took place July 5 in an online mode, where 9 speakers were invited to share their experience on this topic and allow setting the scene for the development of African Smart Cities, considering their potential, but also some bottlenecks.

All in all, two very fruitful sessions with presentations delivered by me, Vitor Pessoa Colombo, Constant Cap, Oualid Ali, Jérôme Chenal, Nesrine Chehata, AKDIM Tariq, Christelle Gracia Gbado, Willy Franck Sob took place and raised a lot of questions, finding the answers for many of them. My talk was titled “Open data and crowdsourced data as enablers and drivers for smart African cities” (see slides below…)

Here, let me immediately mention another activity – a Guest LectureThe role of open data in the development of sustainable smart cities and smart society“, I delivered to students of the Federal University of Technology – Parana (UTFPR, Brazil) and, more precisely so-called PPGEP program – Postgraduate Program in Production Engineering (port. Programa de Pós-Graduação em Engenharia de Produção), in scope of which I was pleasured to raise a discussion on three topics of particular interest – open data, Smart City, and Society 5.0, which are actually very interrelated. This also allowed me to refer to one of our recent studies – Transparency of open data ecosystems in smart cities: definition and assessment of the maturity of transparency in 22 smart cities – published together with my colleagues – Martin Lnenicka, Mariusz Luterek, Otmane Azeroual, Dandison Ukpabi, Visvadis Valtenbergs, and Renata Machova in Sustainable Cities and Society (Q1, Impact Factor: 7.587, SNIP: 2.347, CiteScore: 10.7).

And now, it’s time to turn to two events organized by European Food Safety Authority (EFSA). The first and probably the most “crowded” due to a very high rate of the attendance was the ONE Conference 2022 (Health, Environment, Society), which took place between June 21 and 24, Brussels, Belgium. It was co-organised by European Food Safety Authority (EFSA) and its European sister agencies European Environment Agency, European Medicines Agency, European Chemicals Agency, European Centre for Disease Prevention and Control (ECDC), but if you are an active follower of my blog, you know this already, same as probably remember that I posted about this event previously inviting you to join us in Belgium or online. Since I have elaborated on the course of the event, its main objectives and tracks, I will not repeat this information. Instead, let me briefly summarize key takeaways with a particular focus on the panel for which I served as a panelist – the “ONE society” thematic track, panel discussion “Turning open science into practice: causality as a showcase”. It was a very nice experience and opportunity for sharing our experience on obstacles, benefits and the feasibility of adopting open science approaches, and elaborate on the following questions (although they were more but these one are my favorites):
💡Can the use of open science increase trust to regulatory science? Or does it increase the risk to lose focus, introduce conflicting interests and, thus, threaten reputation? What are the barriers to make open science viable in support to the scientific assessment process carried out by public organizations?
💡What are the tools/ methods available enabling, supporting and sustaining long term open science initiatives today and what could be envisaged for the future?
💡Do we need a governance to handle open data in support to scientific assessment processes carried out by regulatory science bodies?
💡How the data coming from different sources can be harmonized making it appropriate for further use and combination?

These and many more questions were discussed by panelists with different background and expertise, which were nicely presented by European Food Safety Authority (EFSA) breaking down our experience in four categories – social science (Leonie Dendler, German Federal Institute for Risk Assessment BfR), open data expert (Anastasija Nikiforova,) EOSC Association, University of Tartu, Institute of Computer Science, lawyer (Thomas Margoni, KU Leuven ), regulatory science (Sven Schade, Joint Research Centre, EU Science, Research and Innovation). Many thanks Laura Martino, Federica Barrucci, Claudia Cascio, Laura Ciccolallo, Marios Georgiadis, Giovanni Iacono, Yannick Spill (European Food Safety Authority (EFSA)), and of course to Tony Smith and Jean-François Dechamp (European Commission). For more information, refer to this page.

And as a follow-up for this event, I was kindly invited by EFSA to contribute to setting the scene on the concept of ‘standards for data exchange’, ‘standards for data content’ and ‘standards for data generation’ as part of European Food Safety Authority (EFSA) and Evidence-Based Toxicology Collaboration (EBTC) ongoing project on the creation of a standard for data exchange in support of automation of Systematic Review (as the answer to the call made in “Roadmap for actions on artificial intelligence for evidence management in risk assessment”). It was really nice to know that what we are doing in EOSC Association (Task Force “FAIR metrics and data quality”) is of interest for our colleagues from EFSA and EBTC.
Also, it was super nice to listen other points of view and get involved in the discussion with other speakers and organisers – Elisa Aiassa, Angelo Cafaro, Fulvio Barizzone, Ermanno Cavalli, Marios Georgiadis, Irene Pilar, Irene Muñoz Guajardo, Federica Barrucci, Daniela Tomcikova, Carsten Behring, Irene Da Costa, Raquel Costa, Maeve Cushen, Laura Martino, Yannick Spill, Davide Arcella, Valeria Ercolano, Vittoria Flamini, Kim Wever, Gunn Vist, Annette Bitsch, Daniele Wikoff, Carlijn Hooijmans, Sebastian Hoffmann, Seneca Fitch, Paul Whaley, Katya Tsaioun, Alexandra Bannach-Brown, Ashley Elizabeth Muller, Anne Thessen, Julie McMurray, Brian Alper, Khalid Shahin, Bryn Rhodes, Kaitlyn Hair. The next workshop is expected to take place in September with the first draft ready by the end of this year and presented during one of the upcoming events. More info on this will follow 🙂

In addition, I was asked by my Mexican colleagues to deliver an invited talk for monthly “Virtual Brown Bag Lunch Talks” intended for the Information Technologies, Manufacturing, and Engineering Employees in Companies associated with Index Manufacturing Association (Mexico, web-based). After discussing several topics with the organizers of this event, we decided that this time the most relevant talk for the audience would be “Data Security as a top priority or what Internet of Things (IoT) Search engines know about you“. Again, if you are an active follower, you will probably realize quickly that it is based on a list of my previous studies – study#1, study#2, study#3 and book chapter.

All in all, while these were just a few activities I was busy with during the last weeks and, these weeks were indeed very busy but extreeeemely interesting with so many different events! I am grateful to all those people, who invited me to take part in them and believe that this is just one of the opportunities we had to collaborate and there are many more in the future!

ICEGOV2022 workshop: Identification of high-value dataset determinants: is there a silver bullet?

This year the 15th International Conference on Theory and Practice of Electronic Governance known as ICEGOV2022 will be focused on “Digital Governance for Social, Economic, and Environmental Prosperity“. And we – me, Charalampos Alexopoulos, Nina Rizun and Magdalena Ciesielska are glad to announce our own a community-based, participatory, interactive workshop aimed at identifying High-Value Dataset (HVD) determinants towards efficient sustainability-oriented data-driven development.

Briefly about the workshop, our motivation, our objective and why we want to make you a part of it…

Today, Open Government Data (OGD) are seen as one of the trends that can potentially benefit the economy, improve the quality, efficiency, and transparency of public services, as well as transform our lives contributing to efficient sustainability-oriented data-driven development. Their scope, as well as actors who can work with them, do not meet any restrictions. In addition to “classical” benefits such as improving the quality, efficiency, and transparency of public services, they are considered drivers and promoters of Industry 4.0 and Society 5.0 [1,2], including Smart cities trends. OGD is also a driver of economic growth, and, according to [3], the open data market size in 2020 was estimated at €184 billion and it is expected to grow in the coming years reaching €199.51 and €334.21 billion in 2025. However, the achievement of these benefits is closely linked to the “value” of the data, i.e. the extent to which the data provided by public agencies are interesting, useful and valuable for their reuse, creating value for society and the economy. High data availability however can disorient users when deciding which sources are best suited to their needs [4]. The practice demonstrates that the majority of data sets available on the OGD portals are not used, where only a few datasets create value for users [5], [6]. This is also in line with Quarati and Martino [4], who provided a snapshot on the use of 15 OGD portals, based on usage indicators available. This also applies to Latvia [7,8]. In other words, in order to gain benefit from the OGD, countries should open data cleverly, where not quantity, but quality and data value must be more important, since all benefits of the OGD can only be obtained if the data are re-used and transformed to value.

Here, the concept of “high-value datasets” comes, pointing to data that would create highest value to society and economy. The concept of “high-value data” comes into force here. High-value data are defined as the data “the re-use of which is associated with important benefits for society, the environment and the economy, in particular because of their suitability for the creation of value-added services, applications and new, high-quality and decent jobs, and of the number of potential beneficiaries of the value-added services and applications based on those datasets” [9]. Although the PSI directive is a step in this direction by announcing six categories [9], they appear to be generic and do not take into account the national perspective, i.e. the nature of these data sets will depend to a large extent on the country concerned [10,11].
It is therefore important to support the identification of high-value datasets, which would enhance the interest of users of the OGD by transforming data in innovative solutions and services. The research suggests that different perspectives appear in the literature to identify “high-value datasets” and there is no consensus on the most comprehensive, so a number of activities will be taken covering these perspectives but prior identified within the workshop.

This workshop expects to raise a discussion on the identification of high-value data sets for a common understanding of how this could be done in general terms, i.e. what possible activities will lead to better understanding and clearer vision of what are the most valuable data sets for the society and economics of a particular country and how they can be identified (how? who? etc.). The topic under consideration is very important these days, given that the opening up of data sets with high potential for their use and re-use is expected to facilitate creation of new products or services with positive economic and social impact [12]. However, identifying these data is a complicated task, particularly where country-specific data sets should be identified.

This workshop is a step in this direction and is a continuation of the paper presented at ICEGOV2021 [13], where a first step in this direction was taken by conducting a survey of individual users and SME of Latvia aimed at clarifying their level of awareness about the existence of the OGD, their usage habits, as well as the overall level of satisfaction with the value of the OGD and their potential. This time we aim to develop the framework for identification of high-value datasets (and their determinants) as a result of comprehensive study conducted jointly with participants of ICEGOV. All in all, the objective of the workshop is to raise awareness of and establish a network of the major stakeholders around the HVD issue, allow each participant to think about how and whether the determination of HVD is taking place in their country and how this can be improved with the help of portal owners, data publishers, data owners and citizens. Our main motivation is that, as members of the ICEGOV community, we could jointly answer the following questions representing the objectives of the workshop:

  1. How can the “value” of open data be defined?
  2. What are the current indicators  for determining the value of data? Can they be used to identify valuable datasets to be opened? What are the country-specific high-value determinants (aspects) participants can think of?
  3. How high-value datasets can be identified? What mechanisms and/ or methods should be put in place to allow their determination? Could it be there an automated way to gather information for HVD? Can they be identified by third parties, e.g. researchers, enthusiasts AND potential data publishers, i.e. data owners?
  4. What should be the scope of the framework, i.e. who should be the target audience who should be made aware of the HVD applying this framework? public officials / servants? data owners? Intermediaries? (discussion with participants OR direction for our discussion depending on the participants and their profile).

More precisely, the following “procedure” is expected to be followed:

  • STEP 0 (conducted by participants (not mandatory)): participants are invited to get familiar with open data portals of their country (higher coverage, i.e. of more than their own country, is welcome) by inspecting the current state-of-the-art in terms of both the content – data available, functionality with particular interest of HVD determination-related features (if any) including citizen-engagement-oriented features, features allowing to track the current interest of users etc.
  • STEP 1: A brief introduction to the current state-of-the art [approximately 45 minutes]: How HVD are seen by the PSI Directive and what tasks are set for countries regarding determination and opening HVD, how countries are coping with this (both from grey literature and from personal experience on Latvia), what approaches and methods for determining HVDs are known and why is there no uniform method / framework? A brief overview of the results of a survey of individual users and small and medium-sized businesses (SME) of Latvia on their view regarding the current state of the data, i.e. in which extent they meet their needs, and what data might be useful for them, and how their availability would affect their willingness to use these data. Overview of Deloitte report on HVD. What is the methodology used? What are the indicators used? What are the results of the study?
  • STEP 2: Considering the diversity of perceptions of the term “value” (depending on the domain, actor etc.), the discussion in the form of brainstorming (idea generation) is expected to be held providing as many definitions as possible, which are then used to provide a more comprehensive definition(s) considering different perspectives (domain- and actor-related) [approximately 30-45 minutes]
  • STEP 3: Discussion on current methods / mechanisms to determine the current value of the data and determining HVD in the form of brainstorming [approximately 20-30 minutes]
  • STEP 4: Idea generation on potential methods / mechanisms to determine the current value of the data and determining HVD in the form of brainstorming [approximately 20-30 minutes]
  • STEP 5: Iterative filtering of features, methods, approaches that could constitute the framework for determination of high value datasets in the form of DELPHI-like analysis [approximately 45 minutes]
  • STEP 6: Agenda for future research, networking [approximately 30 minutes]

This is a community-based, participatory, interactive workshop aimed at engaging participants – instead of asking participants to write a paper to be later presented during the workshop in the form of sit-and-listen, we expect to establish a lively and interesting discussion of novel ideas, answering existing questions and raising new ones. The audience of the workshop is ICEGOV participants without restriction on the domain they represent, affiliation, interests, knowledge and experience. Both OGD experts and those who are not familiar with OGD are welcome.

Join us this October (4 – 7 October 2022)!

References:

  1. Bargiotti, L., De Keyzer, M., Goedertier, S., & Loutas, N. (2014). Value based prioritisation of Open Government Data investments. European Public Sector Information Platform.
  2. Bertot, J. C., McDermott, P., & Smith, T. (2012, January). Measurement of open government: Metrics and process. In 2012 45th Hawaii International Conference on System Sciences (pp. 2491-2499). IEEE.
  3. Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information
  4. European Comission, The Digital Economy and Society Index (DESI), online, https://ec.europa.eu/digital-single-market/en/digital-economy-and-society-index-desi, last accessed: 7.04.2021
  5. Gagliardi, D., Schina, L., Sarcinella, M. L., Mangialardi, G., Niglia, F., & Corallo, A. (2017). Information and communication technologies and public participation: interactive maps and value added for citizens. Government Information Quarterly, 34(1), 153-166.
  6. Huyer, E., Blank, M. (2020). Analytical Report 15: High-value datasets: understanding the perspective of data providers. Luxembourg: Publications Office of the European Union, 2020 doi:10.2830/363773
  7. Kampars, J., Zdravkovic, J., Stirna, J., & Grabis, J. (2020). Extending organizational capabilities with Open Data to support sustainable and dynamic business ecosystems. Software and Systems Modeling, 19(2), 371-398.
  8. Kotsev, A., Cetl, V., Dusart, J., & Mavridis, D. (2018). Data-driven Economies in Central and Eastern Europe
  9. Kucera, J., Chlapek, D., Klímek, J., & Necaský, M. (2015). Methodologies and Best Practices for Open Data Publication. In DATESO (pp. 52-64).
  10. McBride, K., Toots, M., Kalvet, T., & Krimmer, R. (2019). Turning Open Government Data into Public Value: Testing the COPS Framework for the Co-creation of OGD-Driven Public Services. In Governance Models for Creating Public Value in Open Data Initiatives (pp. 3-31). Springer, Cham.
  11. Nikiforova, A., & Lnenicka, M. (2021). A multi-perspective knowledge-driven approach for analysis of the demand side of the Open Government Data portal. Government Information Quarterly, 101622
  12. Ruijer, E., Détienne, F., Baker, M., Groff, J., & Meijer, A. J. (2020). The politics of open government data: Understanding organizational responses to pressure for more transparency. The American review of public administration, 50(3), 260-274
  13. Nikiforova, A. (2021, October). Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia. In 14th International Conference on Theory and Practice of Electronic Governance (pp. 367-372).