This somewhat longer blog is a reaction to a discussion I had with a client’s tech people on their perceived demise of the star schema as well as a reaction to Claudia Imhoff serving as a poster girl for Incorta’s “Death of the Star Schema” white paper and all other marketing efforts to launch their positioning term named “Direct Data Mapping”. Except, when I ask their people what it exactly is, I get no new nor unexpected answers.
It’s clear that Cloud computing has drastically changed the way data is stored, especially at what cost data is stored. I came across a video from Dave Fowler, CEO of Chartio and I must admit, this is one of the most intelligent strawman arguments against Ralph Kimball’s star schema I have ever come across. Allow me to analyse the video and Dave’s arguments step by step using his PowerPoint slides as an illustration.
The Fallacy
Dave Fowler states that The Data Warehouse Toolkit was published long ago before the advent of Cloud computing but he couldn’t remember when the first edition came out. I have the first edition, signed from the master and I can help Dave with that: it dates from 1996.
Bill Inmon even wrote the foreword, something he didn’t do in the later books. It turned out the two went their separate ways and in my experience, the bottom up, conformed dimensional data marts won every time from Inmon’s corporate data warehouse in the third normal form with functional data marts as extracts from the behemoth. In most cases, the behemoth never got finished as the business users didn’t see any immediate deliverables coming out of the corporate information factory.
Please note that Ralph wrote a few books more than this one as well as hundreds of articles and some of them appeared when Cloud computing was showing its potential.
Ralph published the Lifecycle toolkit which was more geared at project managers but also touched the architectural aspects already in 1998.
The Data Webhouse Toolkit, published in 2000, described data pipelines for processing semi structured data as well as mining these data, a practice we now all know and apply ever since.
The ETL Toolkit, published in 2004 revealed clearly that the star schema is not just about technology, it’s an architectural pattern. And that is exactly why the next screenshot of the video shows a strawman.
Yes, Dave, the first two -purely technical- arguments have been made obsolete by Cloud computing and by column stores. May I remind you that Ralph Kimball was one of the pioneers in this technology when he was the founder of Red Brick Warehouse in 1990?
When the understandability argument is dissected, I get the impression that Dave is getting into a circular reasoning. Wide tables are supposed to be easy to understand by the business but they need some form of staging which David describes as a “still largely normalised model” and I quote “Business users still find these operational models intuitive”.
I thought we introduced star schemas also for the reason that models in the third normal form are difficult to understand by the business users? And now we’re introducing another -even more obsolete- approach to data warehousing, namely the corporate information factory from Bill Inmon 2.0? Really David?
I thought we had passed that station. And to get to the meat of it: the star schema pattern is embedded in a comprehensive architectural framework:
analyse the business processes;
make one transaction fact table per business process and make sure you share the dimensions for consistency;
if necessary make an accumulating snapshot (does this compute well in a single wide table or do you agree it takes an extra implementation step?)
provide flexibility over OLTP systems, functions, departments and all other ephemeral organisational aspects to produce a data warehouse that can deliver data in any of these changing organisational aspects using slowly changing dimensions.
This slide says it all, especially the “VS” in there. In my world, there is no “versus”. Implementation on a technical level can be detached from the conceptual and logical model of a star schema. Let me elaborate on this later in the conclusions of my article.
But first, the flexibility argument
Let me take a look at each of these arguments.
“data will change” yes, their format, their level of completeness with which they describe an object; but seriously, in whatever world we live in, objects like “Party”, “Party Role”, “Geolocation”, “Sales”, “Contact Method” etc… don’t change fundamentally. As I wrote somewhere else: processes are volatile, the data they produce are persistent.
“questions change” yes, they do. If the questions are out of scope of the captured business processes then there is a reasonable case for a simpler architecture, a sort of draft-quick-and-dirty set up to provide an agile answer. I have experienced this in fast growing start-ups like Libertel Netherlands in the nineties and a pharmaceutical start-up company in the early millennium my colleague worked for or a fast growing betting company in Germany I worked for a few years ago. There, concepts like Dan Linstedts Data Vault or a rough 3NF operational data store will do the job just fine. Once the project-like approach of these business processes become routine and are better supported by operational systems, the star schema can take its position in the architecture, probably leaving the data vault as an intermediate repository.
“Business change” this is similar to the “questions change” argument. Most of the time, businesses change because of the answers to questions from well organised data warehouses tell them to. So if you rely on a poorly designed data warehouse, good luck with that.
“Sources change” well, then you just add another one don’t you? This is where a technical problem can be easily solved with data warehouse automation tools like TimeExtender, Qlik Compose and WhereScape. Data warehouse automation supports agility in the best possible way using a simple graphical UI to produce data vaults or star schemas in a documented and incremental way.
“Best practices change” this needs more clarification David. If it’s about best practices in IT project management the various schools of thought are still fighting religious wars while their final output is roughly the same: an OLTP or OLAP deliverable. Levels of cost, quality and completeness of these deliverables may vary independently from the practice applied: DevOps, Agile, Scrumban, V-Model, Waterfall,… you name it.
“Technologies change” that argument won’t age well, will it? You are propagating a column store as the ultimate technology and therefore you position all the upfront work in the light of that goal. I couldn’t care less what implementation technology is used for the star schema architecture and development method from business analysis to logical schema. I have examined the possibilities of implementing it on Hadoop clusters, on key value stores and Spark data frames. They can all handle that, each with their pros and cons. If you don’t have a mental model of the data at hand, how can you guide the users and help them to exploit the potential of the data? A data catalogue and a data science sandbox will certainly help but not every stakeholder in the organisation will be proficient enough to exploit data without a model. “Omnis comparatio claudicat” but you don’t plan a trip by listing all the roads and street names you will need to arrive at your destination. Instead, you use a map.
The data lake has increased the potential of analytics with an order of magnitude but that doesn’t exclude the need for well-structured transaction data. I can show you customer cases where there is a round trip between unstructured data analysis leading to improvements on the structured data model which in its turn,… you get the idea. By the way, one of the functions of a data lake is segmenting the data according to their (presumed) frequency of use and use expensive storage for “gold” data and cheaper versions for “silver” and “bronze”. This can also become a pain if the data requirements change.
“KPI’s change” yes, because questions change but the objects remain the same, so it’s a synonym for data changes and changing KPI’s can be the result of a business change because new critical success factors come into the picture and new capabilities are required (cfr. supra). That’s a no-brainer for any enterprise architect, and I count a data architect as an enterprise architect with your permission.
In conclusion
Now it’s time to address your issues with performance, Dave. In a serverless computing world we need to translate this into “performance costs” right? Because the compute power scales horizontally with increasing information demands. And then there is still the old materialised view approach that is a valid solution: you can also materialize those data marts into tables with partitioning and clustering because you said it yourself that storage is cheap and compute time is expensive.
I understand you need to propel a new way of working with a new tool that supports this way of working. But you needn’t do this by arguing against a comprehensive architectural framework that integrates business processes, applications and data to produce flexible, future proof solutions for the information demands of the business. Tools like DBT, Snowflake, Firebolt and other Incorta’s can perfectly interpret logical star schemas that are well understood by the business and fit these in column stores that have been around for thirty years.
I wish you all the best in your endeavours but please Dave, separate technical implementation from business analysis and modelling down to a logical data architecture. I have seen it many times when new technologies were introduced: the clutter and buzzwords became so overwhelming that the business just decided to do nothing and wait till the dust had settled. Meanwhile, the gorillas in the market are sharpening their knives to either kill, pre-empt growth or acquire the companies in their infancy and bury it in their product portfolio. And that is not what any tech start-up is looking for isn’t it?
Ralph, a serial tech entrepreneur, could tell you a few stories about technology adoption cycles. But don't hesitate to give me a call or drop me a line, I also have a few stories to tell.
Epilogue
Please don’t frame me as a Kimball groupie. I have documented tough questions for the master like these here. Modern data architecture discipline needs to reconcile proven technology with innovation. In this case, there is no conflict between the two. Let’s all benefit from that situation.
Post Scriptum:
There is a need for well vetted transactional data and these use cases can never be supported by instantiating wide tables on an ad hoc basis. You need a trusted batch process, using trusted (master) data, e.g. financial, logistic, customer or competitive analysis and reporting.
The combination of a data lake and a “classical” star schema based data warehouse will create the best of both worlds. But even in that environment, data scientists need time and insight to produce meaningful results. They will extract, prepare and pre-process data from the lake before they either formulate and test a hypothesis or have machine learning algorithms like support vector machines, neural nets discover hidden relationships that may provide extra meaning…