Denormalisation: Thoughtful Optimisation or Irrational Avant-Garde?

Author:Murphy | View: 27993 | Time: 2025-03-23 11:53:46

Taking Apart Relational Data – (Image by DALL-E)

This might come as a surprise for some people: data modelling is often a collaborative effort, involving lively debates with people from various fields. It's a typical breeding ground for whimsical ideas and clever tricks, making it a classic subject for debate on the best approach. To make the topic a little more dramatic, I like to think of it as a battle between the gray-bearded purist with a thick handbook of rules and the impetuous avant-gardist who simply throws fancy NoSQL (whatever this means) __ at everything.

Having a set of rules is helpful because it's hard to understand all the effects of decisions, especially when facing a new problem. Experience plays a crucial role in mastering any discipline, data modelling is no different. Established rules and conventions serve as an essential framework to bridge the knowledge gap. Yet, there is a nuanced balance between adhering to these guidelines and being open to experience and pragmatic reasoning.

"Know the rules well, so you can break them effectively." Dalai Lama's 18 Rules for Living

In my opinion, this adage really nails it. I value making educated decisions, and while reasoning with conventional rules is better than no reasoning at all, it shouldn't lead to rigid, unchangeable beliefs.

That said, I would like to devote this article to a classic topic in database design: the extent to which one should normalise a database and the reasons behind it. This topic, often covered in introductory database courses, may seem straightforward, but the formalism involved often sparks opinionated debates. I aim to provide a detailed look into what the terms normalisation and denormalisation actually mean, as a core point of confusion lies in how these terms are used and where they originate from.

I suspect that many readers with experience in databases have their own perspectives on this matter, so I advise taking my writing with a grain of salt. This article is not intended to be a tutorial or a definitive guide, but rather a sharing of my personal reflections.

A Bit of Backstory

I often find myself assigned to projects involving NoSQL databases and Apache Lucene-based technologies, such as Elasticsearch. This experience has made me comfortable with non-relational mental models. However, the significant use cases I encounter are typically limited to full-text search or time series data. Most other uses seem to be cases of trend hopping, which I must admit, I am guilty of as well.

In recent discussions about database design, I've noticed a tendency among NoSQL proponents to prematurely denormalise everything, while SQL purists often over-normalise their designs. This broad observation captures a spectrum of behaviours: from rigid adherence to relational conventions to complete disregard for them.

Interestingly, the more experience I gain, the less I consciously think about the formalities of normalisation. Yet, when I review my work, I realise that my designs often naturally adhere to normal forms. A relational view of data subtly shapes my mental model, even when I'm not explicitly considering it. Despite having worked with various database systems, I acknowledge that not every case requires strict safety constraints and a carefully normalised design.

Navigating between these opposing viewpoints can be exhausting, especially when design decisions involve both extremes. Nevertheless, it's important to recognise that both perspectives offer valid arguments.

What Exactly is Database Normalisation?

Let's kick things off by taking a close look at the term normalisation itself. As stated above, I am referring to normalisation in regards of relational data. Another closely related usage of this term in Data Science is about adjusting values to a common scale or range. Defined this way, the terms' usage resembles a similar concept across different domains. Both involve transforming data to achieve consistency and comparability across different contexts.

In the realm of databases this concept was developed by Edgar Codd in his arguably famous paper from 1970 entitled "A Relational Model of Data for Large and Shared Data Banks", where Codd lays the general foundation of relational databases. Thus, normalisation is somewhat a core principle of relational databases and is described by the author on Page 5 as:

[…] eliminating nonsimple domains appears worth investigating! There is, in fact, a very simple elimination procedure, which we shall call normalisation.

This definition aged well, but is of course a child of its time, as Codd's description of relational data derives from examining the problems of established systems, operating based on tree or graph models, such that the process of normalisation is exercised by transforming these alternative models into relational ones. The quote above highlights the elimination of nonsimple domains. In Codd's terms, a relation and a domain are understood today as a table and a column, respectively. This translates the above into the process of reducing columns that contain non-atomic data to atomic data and converting nested tables into flat ones, which are then referenced by primary and foreign keys to retain the original meaning. This leads me to today's popularised concept based on this:

Normalisation is the process of (re-)structuring a Database into its normal form.

Isn't that an elegantly self-referential definition? It's almost like saying, "To normalise is to make normal" – not exactly the most enlightening explanation. However, it nicely bridges to the fact that normalisation simply means adhering to normal forms, which raises the question: what are normal forms? Answering this question also clarifies normalisation by simply defining normal forms. In order to not go beyond the scope of this article, I will refrain from using formalities in various places. I want to highlight that I use the term relation here a bit vaguely. I'm generally not referring to the mathematical definition, which is relevant for Codd's definition of the model, but more in a everyday use sense. I'm referring to the general concept of associations between data records, which is also realised in relational databases by foreign keys or in graph models by edges.

First Normal Form (1NF)

The First Normal Form is defined by the requirement that columns contain only atomic values. For example, instead of having a single column for an address, each component of the address should be split into separate columns.

In addition to this requirement, Codd introduced several constraints that complement to this concept and are essential to the relational model:

Information should not be encoded in the order of columns or rows.
Domains should be consistent, meaning data types within a column must not be mixed.
Every table must have a primary key to uniquely identify each row.

These constraints are fundamental to the relational model and enable the use of powerful query languages like Sql.

Second Normal Form (2NF)

The Second Normal Form builds on the requirements of the First Normal Form and introduces additional constraints to further reduce dependency issues:

The table must already be in first normal form.
All non-key attributes must be fully functionally dependent on the primary key. This means that each non-key attribute should depend on the entire primary key, not just a part of it.

By reducing partial dependencies, issues related to data deletion are mitigated. In a table where certain columns depend on only a subset of the primary key, deleting a record by its primary key can result in the removal of data that is not dependent on that key.

Third Normal Form (3NF)

Carlo Zaniolo provided a refined definition of the Third Normal Form in his 1982 paper, "A New Normal Form for the Design of Relational Database Schemata". According to Zaniolo, a table is in third normal form if every non-key attribute is dependent only on the primary key and not on any other non-key attribute. This definition clarifies that the third normal form specifically addresses dependencies between non-key attributes, unlike the previous normal forms. To put it in simple rules:

Every non-key attribute must be dependent solely on the primary key.
Non-key attributes should not depend on other non-key attributes.

A closely related form is the Boyce-Codd Normal Form (BCNF), which extends this constraint to all attributes, not just non-key ones. It's important to note that there are several iterations and evolutionary steps beyond the initial forms defined by Codd.

Fourth and Fifth Normal Forms (4NF & 5NF)

Ronald Fagin described the Fourth and Fifth Normal Forms in 1977 and 1979, respectively. These forms introduce more complexity and can be harder to grasp. Many resources use overly artificial examples, making these concepts hard to remember. Here is a simple rule of thumb to understand them, although this simplification leaves out some details.

To bring a table into Fourth Normal Form, tables that have columns with independently multiple values for a single key should be split. For example, a product with multiple colours and sizes should be divided into separate tables to avoid redundancy and invalid cases. This prevents issues such as introducing a new size accidentally introducing an invalid color if the column is mandatory.

A table is in Fifth Normal Form if it cannot be broken down into smaller tables without losing any information, and combining the smaller tables back together reconstructs the original table exactly as it was. Using the product colours and sizes example, joining both tables should still be possible without losing any information or introducing inconsistencies.

Misleading Terminology

After this attempt to breaking down normal forms, I realise that the term normalisation might be somewhat misleading. Originally, Codd presented it as a method to transform other existing data models into relational ones. Nowadays, however, normalisation is often introduced as tool or optimisation step when working with relational systems, which I believe undermines its true significance. Practically speaking, normalisation is the process of decomposing large tables into smaller, less redundant ones.

Whats the Purpose of All This ?

Codd was motivated to develop a new way to manage data due to the limitations of existing hierarchical and network-based database models in the 1970s. These models were complex, difficult to modify, bound to physical storage, and prone to undesirable characteristics. Codd's relational model introduced a more straightforward, mathematically grounded approach that was disconnected from the constraints of physical storage.

In today's terminology, unwanted behaviours when working with data that oppose good quality are described by three terms:

Insertion Anomaly: This occurs when certain attributes cannot be inserted into the database without the presence of other attributes. For example, if you cannot add a new student to a database without assigning them to a course, it creates an insertion anomaly.
Deletion Anomaly: This happens when the deletion of certain data inadvertently causes the loss of additional, unintended data. For instance, deleting a course from the database might also remove all associated student records if they are stored in the same table.
Update Anomaly: This arises when one piece of data requires multiple updates in different places. For example, if a student's address is stored in multiple tables, changing their address requires updating each instance.

While these anomalies accurately describe the behaviour encountered with non-normalised data, the overall dataset is often inconsistent or lacking integrity. For example, an update anomaly can result in inconsistent data, such as when a student's address cannot be inferred due to multiple different addresses appearing in the records.

Additionally, non-normalised design can lead to data integrity issues beyond inconsistencies, such as mixing data types in a single row, making it difficult to enforce constraints and prevent the insertion of invalid data. Redundant data, which is implicitly removed when normalising a design, also plays a significant role in large systems, where redundancy can cause unnecessary storage loads.

For the narrative of this article, I want to define data quality by these aspects. While the term is usually used in a much broader sense, I'll restrict the definition to these cases.

Breaking The Rules: Denormalisation

After all, the inception of NoSQL databases highlighted a clear demand for designs that are not strictly normalised. That's essentially what denormalisation is: the active process of breaking the established rules of normalisation. I emphasise the word ‘active' because it involves intentionally altering normalised data, not preemptive design decisions. There is a subtle difference between actively adjusting designs and making anticipatory choices during the design phase. While I acknowledge that there are many valid cases for preemptive decisions against normalised design, my focus here is on the active process of denormalising as opposed to not normalising.

Practically, however, there is little distinction; people often refer to denormalised design as anything that is not normalised, regardless of how it was produced. Nowadays, the term is frequently used in contexts where relational designs are transformed into NoSQL designs. For example, in Cassandra, this involves more delicate nuances like designing around partition keys.

The Likes of Justin Bieber

A very interesting case study of geographical scaling can be found in a post by the Instagram engineering blog on Medium. It's mostly a typical "how big business manages massive scale" type of thing, but it provides a great outline of the most popular technique when referring to denormalisation: The elimination of slow joins.

Around 2015, Instagram started facing significant problems when massively popular users, like Justin Bieber, posted content. Their servers would often get overloaded, leading to timeouts and user complaints. To address this, they optimised their model of likes on posts after other possibilities for optimisation in other layers had already been exhausted. They most likely started by using simple SELECT COUNT(*) statements (that's just a guess from my side) and scaled by adding caching. After this was not enough, they started introducing a count column and incremented it. This adds a tiny bit more work on update operations but boosts the performance of read operations. In the end, getting the likes was a simple select on a single record and column. This solved the problem of timeouts and other user complaints when Justin Bieber posted and servers overloaded.

This brief description encapsulates the essence of denormalisation: the intentional introduction of redundant information to enhance read performance. It gets clear that denormalisation is a double-edged sword in this regard. While it can massively improve read performance, it can also degrade update performance and storage cost due to redundancy. In this case it seems trivial but of course the performance loss can be considerable in other cases. This should be a key takeaway: denormalisation does not necessarily make your database faster overall.

Secondly, it's clear that denormalisation was not the immediate tool to use. It was implemented after caching and other optimisation tricks were already in place. The takeaway here is that denormalisation is never without trade-offs and is most likely not the best first option.

Lastly, it's rather easy to see that the process of adding a counter column introduces an easy way to get inconsistency, or precisely to cause update anomalies. When new likes occur, not only do the likes themselves have to be created, but the counter also has to be updated. This introduces the risk that likes and the counter get out of sync.

Beyond Etymology

The previous example of adding a counter column to a table is significant, although technically it's not denormalisation, despite the blog posts' claim. Adding a counter column for pre-aggregated values about related tables does not necessarily violate any normal form. Even with the introduction of a counter column, all columns still depend solely on the primary key, assuming the design was already normalised. However, it introduces potential inconsistency into the design; if entities are created and the counter is not updated, the counter no longer accurately reflects the count of its corresponding entities. This leads us to what is commonly understood as denormalisation: the deliberate introduction of redundant data to enhance read performance, essentially trading off storage space for (some) faster query times. With this definition there a various practical techniques.

Summary Tables

Summary tables are a technique used to store aggregated data for reporting or generally quick access. For example, in a financial system tracking transactional payments, summary tables can be created to monitor payments across various dimensions such as customer, payment method, month, quarter, and region. Instead of frequently running complex queries on raw transaction data, summary tables provide precomputed totals, thereby improving performance.

A common tool offered by most relational databases for this purpose is the Materialised View, which is often used synonymously with summary tables.

Eliminate Joins

Another technique under this umbrella is the elimination of joins by storing frequently accessed data directly within the main table. For example, instead of using a separate table for payment statuses, the payment status can be stored directly in the transaction table. This reduces the need for join operations, thereby speeding up query execution.

Performance is not the only reason for this approach. I often encounter this argument when speaking with business analysts who aim to simplify their query complexity to increase their productivity and speed.

External Data

When integrating with external systems that use non-normalised schemas, you can store the data in compact columns, such as JSON, instead of normalising it. This simplifies data retrieval and reduces the complexity of the database schema, aligning it more closely with the external system's design. However, this method doesn't significantly improve data retrieval or reduce storage space.

I frequently use this technique for archiving, similar to soft delete approaches, by storing normalised data in JSON columns for compliance or restoration purposes. This is akin to the concept of hot, warm, and cold data archival, where accessibility may be reduced but remains acceptable.

When Data Quality Takes a Hit

A trade-off when denormalising is the reduced quality of data. This was particularly evident for me in trend-hopping scenarios where natural relational data concepts were forced into non-relational systems like MongoDB during its early days. Without transactional guarantees, I often had to write extensive code to handle various scenarios, including manual joins. Updating values in new attributes added to existing records also required extra code.

Anyone working with data in such contexts understands how poor quality can turn simple tasks into prolonged and complex challenges. For me, I regularly encounter the following scenarios:

Inconsistent Data: I often faced the challenge of dealing with inconsistent or duplicated data. This not only created confusion but also hindered decision-making processes, whether for business or technical purposes.
Increased Maintenance: I spent countless hours cleaning and validating data, diverting my attention from more productive tasks. This tedious and monotonous work becomes nearly impossible to manage correctly without proper documentation, especially when addressing anomalies.
Impossible Downstream: Poor data quality made downstream tasks significantly more labor-intensive. For example, error handling in software development, already a mammoth task, required much more effort because fewer assumptions could be made about the data. Similarly, producing plausible outcomes in business reporting demanded extensive extra work to ensure reliability, as the data's inconsistencies skewed insights.

My Rule of Thumb

If I'm uncertain about the future scale of my data, I normalise as much as possible, with the understanding that denormalisation can be applied later if needed. The practical perspective is that by the time performance issues necessitate denormalisation, I should have the resources – time, money, and expertise – to address them. If not, I might be facing bigger problems.

For large datasets where denormalisation seems necessary, such as for aggregation purposes, I ensure all other optimisations at different layers are exhausted first. There might be ways to improve performance without any trade-offs.

When deciding to denormalise, I back my decisions with concrete metrics and benchmarks. Without solid evidence, premature optimisations can create more problems than they solve. It's crucial to ensure that the benefits of denormalisation are clear and justified.

Modern relational databases recognise the need for denormalisation. For instance, PostgreSQL's support for JSON types allows me to store lists of values and still query them efficiently. This feature can reduce the trade-offs involved in denormalising parts of my design and should be considered when deciding on denormalisation.

Final Thoughts

It's strange to say normalisation is a convention – it's rather the core foundation on how to get to a relational design. In discussions or articles, this topic is often presented as a set of tools, but it's actually a core foundation of the whole idea. The fact that it's not enforceable is because normalisation requires domain knowledge and an understanding of data semantics, which the system can't provide. Therefore, it is the designer's responsibility to follow these principles intentionally and not break them unknowingly.

To answer the question raised in the title of this article: Yes, I believe that denormalisation often devolves into an irrational attitude, especially as it trickles down from big tech into smaller teams and becomes a matter of premature optimisation. However, denormalisation is as natural and inherent to the relational model as normalisation. Both are a compassionate effort based on domain knowledge and its interpretation, such that they can be subjective. The lesson here is to be conscious of why these techniques are applied and to be aware of their consequences, benefits and costs.

As is often the case in this discipline, we tend to run in circles because every decision involves compromises.

Did I get something wrong, or do you think I missed the point here ? Get in touch with me on mastodon or x.com.