Higher level data model wanted!

October 2017
Development

Picture your average SaaS startup. It’s likely going to be a CRUD app in Rails/Django/Express backed by a transactional relational database like MySQL or PostgreSQL. Over time the engineers build up a sizable, normalized data model that approximates well how the user thinks about the problem the application is trying to solve. The users end up inputting records with various attributes, and then querying/displaying those records in various ways.

As you venture on this journey, questions about your products usage naturally arise: How many new users do we have this week? What is our churn? What are the most popular items people order?

Easy, a quick SQL query onto the production DB for the engineer, 5 seconds, done. However, your company takes off, you add client operations and other non-technical folks, the number of questions explode. The nature of the questions also become more difficult as the data model expands, various dimensions and metrics are relevant and tables have to be joined. The engineers are sick of answering your questions, but how do you maintain being a data-driven organization?

How to scale?

At this point you might choose to hire a data analyst whose dedicated job it is to answer all these questions, by giving him access to a read-only replica of production. To answer all the questions, he’ll have to be familiar with the transactional data model, as well as the rapid changes to it as the application continues to evolve. He’ll have to effectively duplicate the data model from the application for his purposes with considerate efforts for how the tables in the database relate (unless the relationships are super obvious).

The alternative to the data analyst (who will quickly become a bottleneck himself) would be to effectively create an ETL pipeline that takes your data from the transactional database and shuttles it into a data warehouse with well established models optimized for analytical queries for non-technical folks in a self-serve manner. However, this approach comes with high upfront costs and the need to maintain the ETL pipeline, something that can be overkill for many organizations that try to stay lean.

So about those Joins by Metabase raises this seemingly inherent tension between what they call Pain-on-Read vs Pain-on-Write, the fundamental opposition of a transactional query profile against a analytical one. The conclusion it seems is to admit defeat and realize the gradual need for an ETL pipeline and using SQL views as a 'ghetto version' in the meantime. But what if we don’t have to settle?

One data model to rule them all

Users, products, orders, items, whatever it is, the engineers writing the application refer to the same concepts as the analysts who try to answer questions for the business operation. That engineers prefer their data in normalized relational databases and analysts in columnar data warehouses is really just an implementation detail to support performance characteristics of their access patterns to the same data model.

What if we didn’t think of the problem as translating between two foreign data models, but accommodating differentiated access to the same data model? The same way specifying a data model in a ORM simplifies access to it for engineers, shouldn’t access to a higher level data model allow analysts to simplify access to it for analytical queries? To the best of my knowledge, Looker’s approach with a LookML achieves just that. Once there, couldn’t you go one step further? Specify the data model once and simplify access for both engineers and analysts!

The dream

Not only would a unified data model specification help with efficiently establishing access in your organization, from a community perspective it would open up complete new classes of approaches and applications.

Since it is inherently cross-platform, having multiple services access the same database becomes dead-easy. Database vendors would only have to support query generation for the data model specification instead of providing libraries for all kinds of platforms. By abstracting away the actual low-level queries to more higher-level business concepts, you could build applications that don’t have to reinvent the wheel when it comes to query generation and database access. Siri could understand your question with its embedded entities without being tied to a particular platform.

One might argue that SQL is such specification that stood the test of time and that argument is perfectly valid. However, it does appear to me it is time for a higher level abstraction as the quality of access to data has arguably not really improved in a long time. (Though I heard Looker is a game changer)

I have built a quick proof of concept with Watcher, but would love to hear your thoughts on the topic and how to move it forward.