Online International Training and Capacity Building Program-2024 (ITCBP-2024) for the School of Planning and Architecture, New Delhi and my talk on “Data Management for AI Cities”

Yesterday, I had the honor of serving as an Expert speaker for an Online International Training and Capacity Building Program-2024 (ITCBP-2024) on “Data Management for AI Cities”, organised by the School of Planning and Architecture, New Delhi (SPA FIRST) that invited me to deliver a talk on “Data Visualisation for Cities: City Based Applications”.

During this talk, we touched on several important aspects surrounding data management and visualization in and for cities, including:

  • Data management that was then deduced to data quality management of both internal and external data, departing from understanding these data to managing their quality throughout the DQM lifecycle (stressing that data cleaning is not the same as DQM), touching on several approaches to this with greater emphasis on the AI-augmented data quality management – existing tools, underlying methods, and weaknesses that should be considered when using (semi-)automatic data quality rule recognition, depending on the method they use for this purpose;
  • Data governance was then discussed, stressing how it differs from DQM, and what it consists of and why it is crucial, incl. within the context of this talk;
  • Data visualization & storytellingrole, key principles, common mistakes, best practices. As part of this, we covered strategies for selecting data visualization type with tips on how to simplify this process, incl. by referring to chart selectors, but also stressing why “thinking outside the menu” is critical, esp. within city-level data visualization (where your audience is often citizens or policymakers). We looked at the most common and/or successful uses of non-traditional types of visualizations, incl. tools to be used for these purposes, breaking them into those that require coding and those that are rather low- or no-code; noise reduction – simplicity – strategic accents’ use, as well as drill-down (aka roll-down) & roll-up use to convey the message you want to deliver while overcoming highlighting everything and thereby losing your audience. In addition, a UX perspective was discussed, including but not limited some aspects that are often overlooked when thinking about the design and aesthetic color palette, namely the color-blindness of the audience that might “consume” these visualizations and again, tips on how to use it easier – did you you known that there are 300 million color blind people? And that 98% of those with color blindness have red-green color blindness?

So what was the key message or a “takeaway” of this talk? In a very few words:

  • Understand your data, audience and story you want to tell! Understand:
    • your data,
    • the story it tells,
    • your target audience’s preferences and needs,
    • the story you want to tell
    • data suitability
    • data quality
  • Attention-grabbing visuals & storytelling is a key!
    • reduce noise to avoid audience confusion and distraction
    • use drill-down and roll-up operations to keep visualization simple
    • add the context to provide all necessary information for clear understanding
    • add highlights to focus their attention – add accents strategically
  • Consider design – the optimal visualisation type, chart design, environment design, potential color-blindness of your audience
  • Keep track of the current advances, but also challenges and risks, of data visualization in urban settings, incl. but not limited to (1) privacy concerns, (2) data silos, (3) technological limitations.

All in all, it was quite a rich conversation and I am very grateful to the organizers for the invitation to be part of this event and to the audience for the very positive feedback!

UT & Swedbank Data Science Seminar “When, Why and How? The Importance of Business Intelligence”

Last week I had the pleasure of taking part in a Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence. In this seminar, organized by the Institute of Computer Science  (University of Tartu) in cooperation with Swedbank, we (me, Mohammad Gharib, Jurgen Koitsalu, Igor Artemtsuk) discussed the importance of BI with some focus on data quality. More precisely, 2 of 4 talks were delivered by representatives of the University of Tartu and were more theoretical in nature, where we both decided to focus our talks on data quality (for my talk, however, this was not the main focus this time), while another two talks were delivered by representatives of Swedbank, mainly elaborating on BI – what it can give, what it already gives, how it is achieved and much more. These talks were followed by a panel moderated by prof. Marlon Dumas.

In a bit more detail…. In my presentation I talked about:

  • Data warehouse vs. data lake – what are they and what is the difference between them?” – in a very few words – structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT. With further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons? 
  • Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
  • So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
  • But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
  • So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
  • Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
  • If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…

In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse, although not a super new concept, is still debated a lot, including but not limited to, on the definition of such).

However, my colleague Mohamad Gharib discussed what DQ and more specifically data quality requirements, why they really matter, and provided a very interesting perspective of how to define high quality data, which further would serve as the basis for defining these requirements.

All in all, although we did not know each other before and had a very limited idea of what each of us will talk about, we all admitted that this seminar turned out to be very coherent, where we and our talks, respectively, complemented each other, extending some previously touched but not thoroughly elaborated points. This allowed us not only to make the seminar a success, but also to establish a very lively discussion (although the prevailing part of this discussion took place during the coffee break – as it usually happens – so, unfortunately, is not available in the recordings, the link to which is available below).