📢New paper alert 📢“Predictive Analytics intelligent decision-making framework and testing it through sentiment analysis on Twitter data” or what people do and will think about ChatGPT?

This paper alert is dedicated to “Predictive Analytics intelligent decision-making framework and testing it through sentiment analysis on Twitter data” (authors: Otmane Azeroual, Radka Nacheva, Anastasija Nikiforova, Uta Störl, Amel Fraisse) paper, which is now publicly available in ACM Digital Library!

In this paper we present a predictive analytics-driven decision framework based on machine learning and data mining methods and techniques. We then demonstrate it in action by predicting sentiments and emotions in social media posts as a use-case choosing perhaps the trendiest topic – ChatGPT. In other words we check whether it is eternal love and complete trust or rather 🤬?

Why PA?

Predictive Analytics are seen to be useful in business, medical/ healthcare domain, incl. but not limited to crisis management, where, in addition to health-related crises, Predictive Analytics have proven useful in natural disasters management, industrial use-cases, such as energy to forecast supply and demand, predict the impact of equipment costs, downtimes / outages etc., aerospace to predict the impact of specific maintenance operations on aircraft reliability, fuel use, and uptime, while the biggest airlines – to predict travel patterns, setting ticket prices and flight schedules as well as predict the impact of, e.g., price changes, policy changes, and cancellations. And, of course, business process management and specifically retail, where Predictive Analytics allows retailers to follow customers in real-time, delivering targeted marketing and incentives, forecast inventory requirements, and configure their website (or store) to increase sales. It business process management area, in turn, Predictive Analytics give rise to what is called predictive process monitoring (PPM). Predictive Analytics uses were also found in Smart Cities and Smart Transportation domain, i.e. to support smart transportation services using open data, but also in education, i.e., to predict performance in MOOCs.

This popularity can be easily explained by examining their key strategic objectives, which IBM (Siegel, 2015) has summarized as: (1) competition – to secure the most powerful and unique stronghold of competitiveness, (2) growth – to increase sales and keep customers competitively, (3) enforcement – to maintain business integrity by managing fraud, (4) improvement – to advance core business capacity competitively, (5) satisfaction – to meet rising consumer expectations, (6) learning – to employ today’s most advanced analytics, (7) acting – to render business intelligence and analytics truly effective actionable. Marketing, sales, fraud detection, call center and core businesses of business units, same as customers and the enterprise as  a whole are expected to gain benefits, which makes PA a “must”.

And although according to (MicroStrategy, 2020), in 2020, 52% of companies worldwide used predictive analytics to optimize operations as part of business intelligence platform solution, although so far, predictive analytics have been used mostly by large companies (65% of companies with $100 million to $500 million in revenue, and 46% of companies under $10 million in revenue), with less adoption in medium-sized companies, not to say about small companies

Based on management theory and Gartner’s Business Intelligence and Performance Management Maturity Model, our framework covers four management levels of business intelligence – (a) Operational, (b) Tactical, (c) Strategic and (d) Pervasive. These are the levels that determine the need to manage data in organizations, transform them into information and turn them into knowledge, which is also the basis for making forecasts. The end result of applying it for business purposes is to generate effective solutions for each of these levels.

Sounds catchy? Read the paper here.

Many thanks to my co-authors – Radka and Otmane, who invited me to contribute to this study, and drove the entire process!

Cite paper as:

O. Azeroual, R. Nacheva, A. Nikiforova, U. Störl, and A. Fraisse. 2023. Predictive Analytics intelligent decision-making framework and testing it through sentiment analysis on Twitter data. In Proceedings of the 24th International Conference on Computer Systems and Technologies (CompSysTech ’23). Association for Computing Machinery, New York, NY, USA, 42–53. https://doi.org/10.1145/3606305.3606309

UT & Swedbank Data Science Seminar “When, Why and How? The Importance of Business Intelligence”

Last week I had the pleasure of taking part in a Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence. In this seminar, organized by the Institute of Computer Science  (University of Tartu) in cooperation with Swedbank, we (me, Mohammad Gharib, Jurgen Koitsalu, Igor Artemtsuk) discussed the importance of BI with some focus on data quality. More precisely, 2 of 4 talks were delivered by representatives of the University of Tartu and were more theoretical in nature, where we both decided to focus our talks on data quality (for my talk, however, this was not the main focus this time), while another two talks were delivered by representatives of Swedbank, mainly elaborating on BI – what it can give, what it already gives, how it is achieved and much more. These talks were followed by a panel moderated by prof. Marlon Dumas.

In a bit more detail…. In my presentation I talked about:

  • Data warehouse vs. data lake – what are they and what is the difference between them?” – in a very few words – structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT. With further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons? 
  • Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
  • So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
  • But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
  • So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
  • Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
  • If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…

In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse, although not a super new concept, is still debated a lot, including but not limited to, on the definition of such).

However, my colleague Mohamad Gharib discussed what DQ and more specifically data quality requirements, why they really matter, and provided a very interesting perspective of how to define high quality data, which further would serve as the basis for defining these requirements.

All in all, although we did not know each other before and had a very limited idea of what each of us will talk about, we all admitted that this seminar turned out to be very coherent, where we and our talks, respectively, complemented each other, extending some previously touched but not thoroughly elaborated points. This allowed us not only to make the seminar a success, but also to establish a very lively discussion (although the prevailing part of this discussion took place during the coffee break – as it usually happens – so, unfortunately, is not available in the recordings, the link to which is available below).