Introduction

Even a good database design cannot always protect against bad data. However, there are many cases of bad data where a well thought out database design can protect us against major problems - such as if the data were telling us something we know couldn’t possibly be logically true.

Examples of bad data can range from the simple to the complex (via the insidious!). Consider a customer table where a particular customer has two dates of birth. Such an example might be referred to as a “failure of data integrity”, and as such the data in the entire table can no longer be trusted because it disagrees with itself.

When data disagrees with itself, as in the example above, it reflects more than simply a problem of bad data. It indicates we have a problem in the very design of the database (table) itself.

And thus to the focus of this article; bad data often exposes bad database design where the database has not been normalized - or perhaps has not been normalized to a sufficient level.

Why Normalize?

When you normalize a database table you are, at a high level, structuring it in such a way that it cannot express redundant information. Furthermore you, in some sense, simplify what it is the table is telling you - thereby making it easier to read and use.

Normalized tables are thus:

  • Protected from contradictory data.
  • Easier to understand.
  • Easier to enhance and extend.

More fundamentally, by sufficiently normalizing a database table you protect yourself (and your data) from three common anomalies that often appear:

  • Insertion anomalies.
  • Update anomalies.
  • Deletion anomalies.

What is Normalization?

There are sets of criteria that we can use to assess our “exposure” to bad database design - “exposure” in the sense that bad data would be far more likely to invalidate the entire table. These criteria are referred to as the “normal forms”, such as the first normal form, second normal form, third normal form, and so on.

These normal forms can be thought of in analogy to safety assessments one might conduct to any other engineering endeavour. For example in bridge design, if the bridge were to pass (only) a level 1 safety assessment the engineer would know for sure that is was safe for pedestrian traffic. But anything heavier, such as a car, would pose a risk to the structural integrity of the bridge. So if the bridge was to then undergo, and pass, a level 2 safety assessment the engineer would be confident that small car traffic could use the roadway. And again anything heavier, like a lorry or large van, would reintroduce a safety risk. And so in turn you can imagine applying consecutively higher levels of safety assessment that progressively allowed for heavier or more frequent traffic.

The normal forms of database theory work in a similar way. If a database table meets the requirements of first normal form (1NF) we know it meets the bare minimum data integrity safety gaurantee. But if we knew it also met the second normal form (2NF) then we’d know it also met those more strict safety gaurantees. And so it would continue through the higher normal forms.

First Normal Form (1NF)

Suppose you were to ask “what are the 7 colours of a rainbow?”. One person might answer yellow, blue, red, orange, green, violet, and indigo. Meanwhile someone else might answer green, indigo, blue, orange, red, yellow, and violet. To those asking and answering the question, both answers are correct … and both answers are equivalent. After all, it is only the order in which the answers are given that differs. Likewise, with respect to a relational database, the same principle applies; we get the answer back to us in an arbitrary order.

SELECT Colour_Name
FROM Colours_Of_Rainbow;

However, now suppose we were to ask: what is the mean wavelength (in nanometers) for each colour of a rainbow - from longest to shortest? Clearly here it isn’t simply the colour names that convey meaning - it is also the order in which they are returned to us that has a crucial meaning as well. And so although the first example, with just the colour names, was perfectly understandable to a human - is was, however, not normalized as far as a relational database is concerned. Since there is no concept of row order within a database table (other than alphabetically) - the table, regardless of the SQL used, is unable to provide the correct answer.

First 1NF Violation

And so we have our first violation of 1NF - when we use row order to convey information in our database table we are violating first normal form.

For the above examples the solution is simply that we should be explicit in our design. Rather than just listing out the colour names - i.e. having a table with only one column for colour names - instead we should also devote a second column to the colour’s mean wavelength. And the units of wavelength, nanometer, provide a natural ordering we can use in subsequent queries.

Second 1NF Violation

The second way of violating 1NF occurs if we mix data types. For example, if in the second column from the example above, we expressed some mean wavelengths as floating point numbers while expressing others as strings. Although a spreadsheet system (like Microsoft Excel) might allow this, in a relational database you would not be allowed such a freedom. This kind of ambiguity in the data types being stored would not be allowed.

And so to our second violation of 1NF - mixing data types within the same column violates first normal form.

Third 1NF Violation

The third way of violating 1NF involves designing a database table without using a primary key. A primary key is simply a column (or combination of columns) that uniquely identifies a particular row in a table. For example in the rainbow colours question above we would need to know, unequivocally, that each row tells us the name of a specific colour of the rainbow (and it’s mean wavelength) and nothing else. Thus it would make sense in this example to designate the name of the colour as the primary key. The second column containing the mean wavelength being left untouched.

ALTER TABLE Colours_Of_Rainbow
ADD PRIMARY KEY (Colour_Name);

With this primary key in place the RDMS would prevent us from having rows where the colour name was a duplicate of some other row. And this would be to our advantage as it would prevent us from having two different mean wavelengths for the same colour.

Hence, every table should have a primary key - if it doesn’t, then it violates first normal form.

Fourth 1NF Violation

The final way of violating 1NF involves repeating groups. For example, the third row in the Inventory column of the following example table is a repeating group:

Player_IDInventory
tstark2 lasers , 5 missiles
srogers15 powerballs
nromanoff4 shields, 6 staff, 10 powerballs, 8 missiles

In the above example table, it isn’t hard to imagine there could be many other types of items that a player could have and likewise many other different combinations of these items that may be present in the player’s inventory.

So now, you could design a table that defined the inventory as a string of text - but clearly this would become overwhelming very quickly and prohibitively cumbersome to query with even the simplest of queries.

Alternatively, you could have individual columns for each type of inventory item (one for lasers, one for missiles, etc., etc.) - and indeed individual columns for the tally count of each of those items. However, again, given that (for example) there could many 100’s of possible inventory items - or worse still, there may even be no limit to the number of possible inventory items - then this too would not be a particularly practical solution either.

And so, the presence of these “repeating groups” of data items, on a single row, would violate first normal form.

A better solution would be to have individual rows for each item type that a player has in their inventory.

Player_IDItem_TypeItem_Quantity
tstarklasers2
tstarkmissiles5
srogerspowerballs15
nromanoffshields4
nromanoffstaff6
nromanoffpowerballs10
nromanoffmissiles8

Here we allow for the many 100’s of possible inventory items (and combinations of item) by extending the number of rows in the database table. The advantage being that our queries of the table will be the same, regardless of how many rows there are. The complexity of any query arises by virtue of it’s columnar structure (the table design) such that the number of rows of data becomes almost irrelevant. And you can imagine here we would designate the combination of the columns Player_ID and Item_Type as the primary key.

Rules: First Normal Form

The following would violate 1NF and are thus not permitted:

  • Using row order to convey information.
  • Mixing data types within the same column.
  • Having a table without a primary key.
  • Allowing repeating groups.

Second Normal Form (2NF)

Suppose we wished to enhance the previous example by including a player rating.

Player_IDItem_TypeItem_QuantityPlayer_Rating
tstarklasers2Intermediate
tstarkmissiles5Intermediate
srogerspowerballs15Beginner
nromanoffshields4Advanced
nromanoffstaff6Advanced
nromanoffpowerballs10Advanced
nromanoffmissiles8Advanced

You’ll notice that we now have repeated information, given that as a solution to the repeated groups problem we introduced a row for each inventory item - now meaning that the Player_ID and Player_Rating information is necessarily repeated over each row.

Here then we note that it is not good design practice to repeat information in such a way. To demonstrate, suppose Player_ID srogers were to lose the 15 powerball items in their inventory. We would have to remove that row from the table. But this would also require us to remove the player rating. Thus we wouldn’t be able to query the player rating at all. It would be as if the player srogers had entirely disappeared.

In fact this is known as a “deletion anomaly” - where the database would no longer know anything about a particular entity because it was deleted when some associated data was deleted.

Alternatively, suppose player tstark changed their rating from Intermediate to Advanced. To now capture this new rating in the above table we would have to update two records (once for the row with lasers, and again for the row with missiles). But suppose that during this process, some external network error affected our connection to the database such that as a consequence only one of those records was updated. We would now have contradictory information in the table - one row saying tstark had a Player_Rating of Intermediate and the other row saying it was Advanced. Such a logical inconsistency is known as an “update anomaly”.

Finally, suppose a new player was added. Since this player is inevitably going to start with no items at all in their inventory, then there would be nothing for us to place in either the Item_Type and Item_Quantity columns. If the notion of a Null field wasn’t recognised then a record (row) for this new player wouldn’t exist and thus there would also be no record of their rating. It would be as if this new player didn’t actually exist - a so-called “insertion anomaly”.

These deletion, update, and insertion anomalies determine that the above example table has violated second normal form. Thus, 2NF is about how a table’s non-primary key columns (the non-key attributes) relate to the primary key. In the table above, these non-key attributes are the Item_Quantity and Player_Rating columns and we can appreciate how neither of these attributes relate in any way to the primary key (Player_ID and Item_Type).

  • Attribute Item_Quantity does depend on the entire primary key.
  • There is a functional dependency between the primary key and the attribute (Item_Quantity) - each value of the primary key is associated with exactly one value of Item_Quantity.

However, the same is not true for Player_Rating. This is a property of the player only i.e. Player_ID. This dependency is the problem - because Player_ID alone is not the primary key.

So, the issue is that we added and attribute (Player_Rating) to a table where it didn’t really belong. The fact that Player_Rating is a property of Player_ID (alone) tells us that a player is an important concept in it’s own right - and thus deserves it’s own table.

Player_IDPlayer_Rating
tstarkIntermediate
srogersBeginner
nromanoffAdvanced

While our original table can stay as it was - except of course with the attribute Player_Rating now removed.

Player_IDItem_TypeItem_Quantity
tstarklasers2
tstarkmissiles5
srogerspowerballs15
nromanoffshields4
nromanoffstaff6
nromanoffpowerballs10
nromanoffmissiles8

And thus we can now say, for both tables, there are no “part-primary-key dependencies”. That is, it is always the case that every non-primary key attribute depends on the whole primary key, and not just part of it.

Rules: Second Normal Form

The following must be maintained to avoid violating 2NF and are thus required:

  • There can be no “part-key” dependencies.
  • Each non-primary key must depend on the entire primary key.

Third Normal Form (3NF)

Suppose, once again we decide to enhance our tables from the previous example. This time by adding an attribute for Player_Skill_Level. And so we define that, with respect to Player_Rating, a Player_Skill_Level between 1 and 3 indicates a Beginner, 4 to 6 indicates Intermediate, and 7 to 9 means Advanced.

However, since we now have Player_Rating and Player_Skill_Level in the same table a new problem arises. Consider, if we were to update a player rating from skill level 2 to 4, we should also update their rating from Beginner to Intermediate. However, suppose another network error effects our connection to the database such that attribute Player_Rating is not updated. Once again we would have a data inconsistency.

The point here is that we went from a situation where Player_Skill_Level was functionally dependent on Player_ID to one where now Player_Rating is also dependent on Player_ID but only indirectly since it is first/directly dependent on Player_Skill_Level. This is termed as a “transitive dependency” between Player_Rating and Player_ID.

Hence the problem is located in the direct relationship between Player_Skill_Level and Player_Rating - because third normal form forbids this kind of transitive dependency. That is, the dependency of a non-primary-key attribute on another non-primary-key attribute.

A simple solution to move us into 3NF is to remove Player_Rating from the table altogether - and thus introduce a new table that uses Player_Skill_Level as the primary key and the Player_Rating and the non-primary key attribute.

Player_IDPlayer_Skill_Level
tstark4
srogers2
nromanoff8
Player_Skill_LevelPlayer_Rating
4Intermediate
2Beginner
8Advanced

Thus the Player_Skill_Level table tells us everything we need to know about how to translate the Player_Skill_Level into a Player_Rating. Every non-key attribute in a table should depend on the primary key, the whole primary key, and nothing but the primary key.

Rules: Third Normal Form

The following must be maintained to avoid violating 3NF and are thus required:

  • Every non-primary key attribute should depend on the primary key, the whole primary key, and nothing but the primary key.

Fourth Normal Form (4NF)

For the fourth normal form, imagine you have a database table listing car model, colour, and style. So you can choose from combinations of car model, and then for that model choose from a range of available colours and styles.

This table (e.g. Model_Colours_And_Styles_Available) would therefore need to contain all possible combinations.

And let’s further suppose the table is already normalized to 3NF, that the primary key consists of all three columns, and finally that every attribute depends on the whole primary key and nothing but the primary key.

ModelColourStyle
Audi A3Artic BlueCoupe
Audi A3Artic BlueCabriolet
Audi A3Flame RedCoupe
Audi A3Flame RedCabriolet
Audi A4Soylent GreenSedan
Audi A4Soylent GreenSUV
Audi A4Sunshine YellowSedan
Audi A4Sunshine YellowSUV

Despite the above table being in 3NF, it is still vulnerable to problems. For example:

  • The Audi A4 is available in colours Soylent Green and Sunshine Yellow.
    • But suppose Audi were to introduce a third colour: Dolphin Grey.
    • This would mean we would have to add two more rows to the Model_Colours_And_Styles_Available table - one for Dolphin Grey Sedan and another for Dolphin Grey SUV.
    • But now, if we encounter e.g. a network error and only add a single new row of e.g. Dolphin Grey Sedan then of course we’ll have a data inconsistency.

Available colours are supposed to be independent of the available styles - but because of the network error (or whatever other error occurred during the update) our table is now saying a customer can choose a Dolphin Grey Audi A4 but only in the Sedan style. When of course we know that the SUV style should also be available in Dolphin Grey.

An impossible situation has thus arisen in Model_Colours_And_Styles_Available.

In this problem the crux lies in whether colour has a functional dependency on the model. The answer is no, because a specific model isn’t associated with just one colour. And yet it seems as if colour does have some relationship to model - it is just that we must express it differently. We can say that each model has a specific set of available colours - this kind of dependency is called a multi-valued dependency, and can be expressed as:

{ Model } ↠ { Colour }

… and it is equally true that each model has a specific set of available styles:

{ Model } ↠ { Style }

Thus, what fourth normal form states is that the only kinds of multi-valued dependencies we are allowed to have in a table are multi-valued dependencies on the primary key. Model is not in the primary key and thus table Model_Colours_And_Styles_Available is not in 4NF.

The solution is to split the table out into multiple tables.

ModelColour
Audi A3Artic Blue
Audi A3Flame Red
Audi A4Soylent Green
Audi A4Sunshine Yellow
ModelStyle
Audi A3Coupe
Audi A3Cabriolet
Audi A4Sedan
Audi A4SUV

And thus now if we need to expand the range of colours available for a particular model we simply add a row to that specific table.

Rules: Fourth Normal Form

The following must be maintained to avoid violating 4NF and are thus required:

  • Multi-valued dependencies in a table must be multivalued dependencies on the primary key.

Fifth Normal Form (5NF)

Imagine there are three different brands of ice cream available. Each of the brands’ offers a different range of flavours, as follows:

  • Frosty
    • Vanilla
    • Chocolate
    • Stawberry
    • Mint Chocolate Chip
  • Alpine
    • Vanilla
    • Rum Raisin
  • Ice Queen
    • Vanilla
    • Stawberry
    • Mint Chocolate Chip

A customer, Jason, states that he likes vanilla and chocolate flavours, but will only purchase from the brands Frosty and Alpine. Meanwhile another customer, Jane, states that she likes Rum Raisin and Mint Chocolate Chip and Strawberry, but will only purchase from the brands Alpine and Ice Queen.

When expressed in a database table we have:

PersonBrandFlavour
JasonFrostyVanilla
JasonFrostyChocolate
JasonAlpineVanilla
JaneAlpineRum Raisin
JaneIce QueenMint Chocolate Chip
JaneIce QueenStrawberry

But now, suppose some time in the future, Jane announces that she also likes the brand Frosty. We would therefore need to update our table:

PersonBrandFlavour
JasonFrostyVanilla
JasonFrostyChocolate
JasonAlpineVanilla
JaneAlpineRum Raisin
JaneIce QueenMint Chocolate Chip
JaneIce QueenStrawberry
JaneFrostyMint Chocolate Chip
JaneFrostyStrawberry

Once again, suppose there was some sort of network error such that we only added the single new row of Jane, brand Frosty, and flavour Strawberry, while the other new options were not updated (again, due to that network error). As such, the other new row that should have been added (Jane, brand Frosty, and flavour Mint Chocolate Chip) was not updated.

As a result we would have a logically inconsistent table - since we had already established that Jane liked the brand Frosty and liked the flavour Mint Chocolate Chip. Thus by the omission of this option from the updated table we introduced (i.e. allowed for) the logical inconsistency.

The problem arose because right at the beginning of the problem we were given three pieces of information:

  1. which brands offered which flavours.
  2. which people liked which brands.
  3. which people liked which flavours.

And what we should have done from the outset was simply to have created a table for each piece of information.

BrandFlavour
FrostyVanilla
FrostyChocolate
FrostyStrawberry
FrostyMint Chocolate Chip
AlpineVanilla
AlpineRum Raisin
Ice QueenVanilla
Ice QueenMint Chocolate Chip
Ice QueenStrawberry
PersonBrand
JasonFrosty
JasonAlpine
JaneAlpine
JaneIce Queen
PersonFlavour
JasonVanilla
JasonChocolate
JaneRum Raisin
JaneMint Chocolate Chip
JaneStrawberry

Such a design would both solve the logical inconsistency discussed - and is in 5NF. Simply put, to ensure that a table (which is already in 4NF) is in 5NF we need to ask whether the table can be logically thought of as being the result of joining some other tables together. If it can be thought of in this way then it is not in 5NF.

Usage of the above three tables within a SQL query might look something like the following:

SELECT
  pbrand.Person,
  bf.Brand,
  bf.Flavour
FROM
  Preferred_Brands_By_Person pbrand
INNER JOIN
  Preferred_Flavours_By_Person pflavour
ON
  pbrand.Person = pflavour.Person
INNER JOIN
  Available_Flavours_By_Brand bf
ON
  pbrand.Brand = bf.Brand
AND
  pflavour.Flavour = bf.Flavour

Rules: Fifth Normal Form

The following must be maintained to avoid violating 5NF and are thus required:

  • It must not be possible to describe a table, which is already in fourth normal form, as being the logical result of joining some other tables together.

Conclusion

As a conclusion it is sufficient to simply say that the number of levels of normalization you design your database tables to adhere to depends entirely on the problem at hand. It may be perfectly reasonable to expect your tables to meet only 1NF and 2NF.

As a rule of thumb, the more data you are storing, and/or the more complex the problem, then the wiser it is to ensure your design adheres to higher levels of normalization. It is the complexity of the problem that usually dictates how many relationships you need to capture, and how many places bad data can expose weakness in your design. By designing against a higher level of normalization from the outset you reduce the likelihood of data gremlins ruining your table design.