Making large-scale data systems usable with schema renaming
I wrote about some of my work at Kensho and published it a few months ago but hadn’t gotten around to linking it here until now. It’s on a technical topic but I’ve written it to be reasonably readable even if you don’t write code on a regular basis. Hope you enjoy!
When working with large, complex datasets, it’s vital to be able to efficiently find the data you need for a given task. Database schemas play a key role in this process by describing the structure of the data involved. However, schemas can often be hard to use for a variety of reasons: mysterious names, needing to join across many tables, or simply because they represent a complicated reality. As a result, it’s often hard to find what you need, especially if things are so complicated that it’s infeasible for you to manage all the complexity on your own.
We’ve previously discussed how using the compiler increases Kensho engineers’ productivity. Now we’ll dive into how schema renaming further improves on our existing tools by making it easy to navigate schemas without needing to modify the underlying databases.
The problem with schemas
When you first learn about databases, the schema acts as your north star. It tells you the structure of the data and how to query for what you want. The examples of schemas you encounter are simple, straightforward representations of information where every table and field name makes sense. If you want to know whose birthday it is today, you query for
Person and filter by
birthday. Everything fits neatly into a simple model of the world, as the focus is on how the databases themselves work and not how to manage large amounts of data. In other words, you don’t need to guess at how to query for the data you need– you can trust that data about companies is stored in a table called, well,
Company. In contrast, when you work with real-life data, especially large amounts of it, things tend to get complicated.
When working with large databases, it’s often necessary to normalize schemas to reduce data redundancy. By representing all data in exactly one place, it’s easier to maintain and extend the database’s functionality. Unfortunately, data normalization often requires splitting data up into separate tables, so queries require more joins than those written against non-normalized ones. As a result, querying a normalized database often requires using many joins– which gets even more complicated when you have hundreds of columns to choose from when writing a join.
Real-life data also tends to have a surprising amount of complexity, so querying for what you want seems to require a lot of lucky guessing because it won’t be obvious how to find what you’re looking for. You’ll often run into naming conventions with non-obvious meanings, even among those that do their best to straddle the gap between a complicated reality and the various constraints of databases themselves.
As a concrete example of the complexity behind real-life data, suppose you want to query for the name of a company. Upon checking out the
Company table, you find several fields like
CompanyNameAccentInsensitive and you aren’t sure which to use. As it turns out, the answer might well be “none of the above”! Companies change names frequently and the name that makes sense in the context of your query might be a join away in an adjacent table. It doesn’t matter what data storage you choose or how skilled a database administrator is– working with a database like this is always going to be complex because it reflects a reality which is inherently complex.
The options that currently exist are:
- Read documentation on every relevant field and table and join, which no longer works when things get sufficiently complicated because there are too many things to keep in mind at once.
- Find someone with more experience with this particular dataset and ask them to help write a query for you, which means you can’t write queries on your own.
What we’d really like to have is a more user-friendly schema without losing the benefits of data normalization, so that anyone at Kensho can be self-sufficient when writing database queries.
”Water, water everywhere, and not a drop to drink.” When you can access all the data but can’t query it.
How existing solutions fall short
One common strategy for working with datasets not suitable for a given use case is by first reorganizing the data via database migrations. These migrations often can be complicated, expensive, and hard to do right– but in our case, that’s the least of our worries. The bigger problem is that most of our databases are shared across hundreds if not thousands of products. Making sure a complex migration doesn’t break anyone else’s product or queries is a task that can easily require weeks if not months of cross-team coordination and planning.
Some databases support views, which can simplify the situation by showing only a subset of the data or making many tables appear as though they are joined together. There are multiple variants of views (ranging from simply redefining what data a user should be aware of, to materialized views, which actually requires making a copy of the underlying data).
Unfortunately, different databases often support different types of views (if at all!), so it’d be difficult to manage all the database-specific details for the wide variety of databases that might come into play. It’d also be infeasible to create a database-wide view encompassing all the data anyone might ever care about, while making the view any more usable than the underlying data itself, considering large, heterogeneous datasets we’re working with.
Synonyms, like the name suggests, let us rename certain parts of the database schema. This can help solve part of the user-friendliness problem. However, synonyms won’t necessarily make for user-friendly schemas. Like migrations and views, synonyms exist at the database level, which means any synonyms that we might define have to make sense for all possible use cases– but, as shown by the
CompanyName example discussed previously, sometimes an application might care about the various ways in which a company can be named and renamed whereas a different application might not. One size does not fit all here.
Our solution: application-specific schema renaming
Creating application-specific schemas
Our solution, which we’ll call schema renaming, allows for defining a schema specific to a particular application. Schema renaming is a set of tools implemented in the GraphQL compiler that allows anyone to make a GraphQL schema more user-friendly by renaming and modifying parts of the schema to be easier to understand.
In contrast to the database-based approaches described previously, schema renaming allows applications to define schemas specific to their use case. The key insight here is that different products depend on different parts of the data. This means the schema for a particular application need not exactly match what exists in the actual databases themselves– the application’s perception of the schema is specific to its needs, and does not affect any other application’s use of the same databases. Schema renaming allows us to have the best of both worlds!
To achieve this, we need an intermediary between the user and all the tooling that the compiler provides. The user perceives and writes queries against a user-friendly schema, while behind the scenes the queries automatically get converted into an equivalent query against the human-_un_friendly, database-provided schema. All of this would take place invisibly and silently, so the user doesn’t need to know anything about the database schema, only what they see in front of them. Benefits of this approach
Doing schema renaming at the application level also sidesteps many potential problems we might otherwise encounter.
Schema renaming is a lower-stakes operation than database migrations. It’s possible to preview what the renamed schema looks like before deciding whether to keep or discard the changes, and it’s easy to undo the renaming if the new schema becomes difficult to work with. As a result, it’s easier to iterate quickly on different ways of presenting data to applications.
In addition, all schema renaming changes can be isolated to a single application, eliminating the concerns about accidentally breaking other products or queries in the process. This means that, unlike with database migrations, it’s easier for individual teams to experiment with what works best for them without needing to coordinate database changes across many teams.
Finally, schema renaming only needs to deal with the single GraphQL schema that the application’s queries are written against. In contrast, trying to make schemas more friendly at the database level would require us to keep track of the extent to which a given database supports views, synonyms, or anything else that might be useful for every single type of database we intend to support– a tall order that only gets taller with every additional potential source of data! Application-level schema renaming makes it easier to create user-friendly schemas.