I am hoping this is not "Yet Another Post about BI on Hadoop", let's see what you think.
Before we delve into rethinking BI on Hadoop, I would like to ask something to the reader. What is the first thing that comes to your mind when thinking about Business Intelligence (BI)?
My gut feeling - based on experience and assuming the audience have some exposure to the term - is that a high percentage of people would have thought about DWH, SQL, RDBMS, followed by Tableau and Qlikview. It just so happens that the standard description of Business Intelligence in Wikipedia does not mention any of these!
The point I am making is that we all have some biased views resulting from our exposure to what we hear, learn, adapt and forget. My hope is that in the following sections of this article I will get you to rethink what we know about BI and more precisely, the scope for Hadoop/Big Data related technologies to provide business information, which I believe is the ultimate goal of BI.
Modelling your data to extract business information
There must be a good reason we keep thinking of DWH, SQL and RDBMS when considering BI. These technologies have been used for decades to represent business information in a concise and efficient way. When storage, memory and computation resources were costly, it was critical to normalise and index data efficiently, minimising costs by reducing duplicate data and avoiding unnecessary computation cycles reading the same or unnecessary details.
Innovations at IBM concerning data modelling by E. F. Codd in 1971 with the Third normal form for OLTP DBs, then Dr Ralph Kimball with his Dimensional Model in DWH presented in his first book "The Data Warehouse Toolkit" back in 1996, cemented data modelling as critical to providing business information. I will follow up on Data Modelling in a future post, where I intend to introduce a number of innovative approaches like Data Vault.
Hadoop - The bulky kid on the block
However BI, like everything else in this fast paced technological world is subject to changes and evolutions. Hadoop appeared originally in 2011 as a platform to crack vasts amounts of Internet data in a simple but efficient way, based on the Google white paper and developed in plain Java at Yahoo; the Map Reduce programming model was a good divide and conquer approach to processing all that clickstream data.
Since then, Hadoop has evolved into a Big Data ecosystem of different features & languages:
- Combining Object-Oriented and Functional Programming with Java, Python and Scala, providing an interpreter for rapid and iterative development;
- Supporting different data flows with Spark (Batch vs. Streaming) including Machine Learning and Graph Analysis;
- Advanced SQL query engines for Big Data;
- Combining data workloads for batch and streaming with Kafka, HDFS and Kudu.
In summary, an ecosystem of technologies and vendor solutions developed on top to cover different needs. Hadoop was not here to replace DWH or other BI solutions but to complement and enrich the way we provide business information.
The Hadoop way for BI
Now that in theory, we have put aside our biased views and we have refreshed ourselves on some fundamental principles about modelling and Hadoop, I would like to share the Hadoop BI use cases which, in my opinion and based on my experience, are the most successful ones:
- DWH augmentation - Either replacing existing ETL workflows or implementing new ones on top of existing DWH implementations, bringing among other things scalability to process big data and cheap archival/staging areas
- COTS solutions based on Hadoop - Highly specialised platforms for accessing and understanding Big Data. Some examples include:
- SQL on Hadoop engines, a.k.a. Massive Parallel Processing (MPP) engines, like:
We will go through some of these topics in future posts, but some core strengths of Hadoop are:
- Hadoop allows the enterprise to augment existing BI tools, bringing Advanced Analytics in the form of Scalable Machine Learning
- It aims to support Predictive and Prescriptive analytics, while the more Descriptive Analytics can remain in existing BI platforms
- It is THE environment to enable iterative exploratory analysis leveraging "Schema-on-Read" with full data access
You may wonder why I omitted terms such as "Data Lakes", "Data Hubs", "Snowflake schemas" and so on, common in BI terminology. I just tried to include the essential bits in this first post about rethinking BI on Hadoop, I will refer to more exotic terms once I delve into some of these areas in future posts.