When deciding on a table structure for analytical data, I find myself balancing the definition of the business with the usefulness of the table for reporting. This placed me in a number of complicated scenarios where merging definitions in a way that served reporting purposes became cumbersome and error prone.
It turns out that the balance between definition and design wasn't the problem; it was my architecture.
I was combining too many data points into one fact table. I would attempt to create one table that defined the entire life cycle of a sale, or include every factor of a customer membership lifespan that seemed prescient for analysis. Was it usable for analytics? For drilling down and finding correlations within trends? Sure—if you were highly skilled at understanding all of the data points and their relationships. But it made for difficult maintenance, reliance on developers, and in many cases, ETL processes that failed frequently.
Now I'll admit some of this was a product of my design ideas prior to understanding the concept of medallion architecture. The influence of being a data engineer who wanted to have fewer objects to deal with also made a difference. When I finally set these two things aside, I understood better that engineering was meant to serve architecture and not the other way around.
When it came to defining the business, I realized it was a much cleaner proposition. I thought to myself, "How does the business define itself in terms of its data? What are words, phrases, and structures that they use in their own language?" I realized that their definitions came down to two things, typically: a single atomic entity or event, such as a customer, a contract, or a sale, and the relationship between these data points. What if I let these ideas influence how I design analytical data?
A term I had not heard used before came to my mind: units of truth. Small tables whose existence was to define one of these individual items. If I wanted a fact table for sales, I would only include information and dimensions that related specifically to the event of the sale itself. No factors leading up to the sale. Nor any following events after the sale. For instance, if product delivery was another aspect of business, instead of putting "is delivered" in the sale fact table, I created a separate fact table called delivery. Sale dates and delivery dates would live in these separate tables, along with whatever information was relevant specifically to that event and nothing else.
These units of truth became my silver layer. From this I found that it was easy to serve up whatever sorts of reports the business was looking for. If an activity report was requested, tracking all of the different events that their teams may have completed the day before, all I had to do was group each of these units of truth by their respective event dates and combine the pieces together by the date. If instead the business was interested in a funnel style report, wanting to see the success of actions from lead to sale to delivery, all I needed to do was to connect the fact tables by their lead customer ID or by their order number—or whatever data point followed the opportunity to sell. Now, instead of constructing long series of complicated case statements within a query, I found myself simply counting rows and grouping by common dimensions across fact tables.
I'm still proving out this methodology, but even in its early stages it seems to have great promise. The benefit that I did not expect is that there are far fewer decisions needing to be made at every point of the process. Moving data through medallion architecture's layers has become quite methodical and almost boring—in a good way. All of the engineers on my team understand exactly what steps are to be taken as data moves through each layer without any sort of complex processing or business dependent decision making needing to be made other than the definitions themselves. The real effect? A lower stress work environment and clearer communication between business and technical teams.
I'll keep you posted on how it goes!