Introduction
Even a good database design cannot always protect against bad data. However, there are many cases of bad data where a well thought out database design can protect us against major problems - such as if the data were telling us something we know couldn’t possibly be logically true.
Examples of bad data can range from the simple to the complex (via the insidious!). Consider a customer table where a particular customer has two dates of birth. Such an example might be referred to as a “failure of data integrity”, and as such the data in the entire table can no longer be trusted because it disagrees with itself.
When data disagrees with itself, as in the example above, it reflects more than simply a problem of bad data. It indicates we have a problem in the very design of the database (table) itself.
And thus to the focus of this article; bad data often exposes bad database design where the database has not been normalized - or perhaps has not been normalized to a sufficient level.
Why Normalize?
When you normalize a database table you are, at a high level, structuring it in such a way that it cannot express redundant information. Furthermore you, in some sense, simplify what it is the table is telling you - thereby making it easier to read and use.
Normalized tables are thus:
- Protected from contradictory data.
- Easier to understand.
- Easier to enhance and extend.
More fundamentally, by sufficiently normalizing a database table you protect yourself (and your data) from three common anomalies that often appear:
- Insertion anomalies.
- Update anomalies.
- Deletion anomalies.
What is Normalization?
There are sets of criteria that we can use to assess our “exposure” to bad database design - “exposure” in the sense that bad data would be far more likely to invalidate the entire table. These criteria are referred to as the “normal forms”, such as the first normal form, second normal form, third normal form, and so on.
These normal forms can be thought of in analogy to safety assessments one might conduct to any other engineering endeavour. For example in bridge design, if the bridge were to pass (only) a level 1 safety assessment the engineer would know for sure that is was safe for pedestrian traffic. But anything heavier, such as a car, would pose a risk to the structural integrity of the bridge. So if the bridge was to then undergo, and pass, a level 2 safety assessment the engineer would be confident that small car traffic could use the roadway. And again anything heavier, like a lorry or large van, would reintroduce a safety risk. And so in turn you can imagine applying consecutively higher levels of safety assessment that progressively allowed for heavier or more frequent traffic.
The normal forms of database theory work in a similar way. If a database table meets the requirements of first normal form (1NF) we know it meets the bare minimum data integrity safety gaurantee. But if we knew it also met the second normal form (2NF) then we’d know it also met those more strict safety gaurantees. And so it would continue through the higher normal forms.
First Normal Form (1NF)
Suppose you were to ask “what are the 7 colours of a rainbow?”. One person might answer yellow, blue, red, orange, green, violet, and indigo. Meanwhile someone else might answer green, indigo, blue, orange, red, yellow, and violet. To those asking and answering the question, both answers are correct … and both answers are equivalent. After all, it is only the order in which the answers are given that differs. Likewise, with respect to a relational database, the same principle applies; we get the answer back to us in an arbitrary order.
SELECT Colour_Name
FROM Colours_Of_Rainbow;
However, now suppose we were to ask: what is the mean wavelength (in nanometers) for each colour of a rainbow - from longest to shortest? Clearly here it isn’t simply the colour names that convey meaning - it is also the order in which they are returned to us that has a crucial meaning as well. And so although the first example, with just the colour names, was perfectly understandable to a human - is was, however, not normalized as far as a relational database is concerned. Since there is no concept of row order within a database table (other than alphabetically) - the table, regardless of the SQL used, is unable to provide the correct answer.
First 1NF Violation
And so we have our first violation of 1NF - when we use row order to convey information in our database table we are violating first normal form.
For the above examples the solution is simply that we should be explicit in our design. Rather than just listing out the colour names - i.e. having a table with only one column for colour names - instead we should also devote a second column to the colour’s mean wavelength. And the units of wavelength, nanometer, provide a natural ordering we can use in subsequent queries.
Second 1NF Violation
The second way of violating 1NF occurs if we mix data types. For example, if in the second column from the example above, we expressed some mean wavelengths as floating point numbers while expressing others as strings. Although a spreadsheet system (like Microsoft Excel) might allow this, in a relational database you would not be allowed such a freedom. This kind of ambiguity in the data types being stored would not be allowed.
And so to our second violation of 1NF - mixing data types within the same column violates first normal form.
Third 1NF Violation
The third way of violating 1NF involves designing a database table without using a primary key. A primary key is simply a column (or combination of columns) that uniquely identifies a particular row in a table. For example in the rainbow colours question above we would need to know, unequivocally, that each row tells us the name of a specific colour of the rainbow (and it’s mean wavelength) and nothing else. Thus it would make sense in this example to designate the name of the colour as the primary key. The second column containing the mean wavelength being left untouched.
ALTER TABLE Colours_Of_Rainbow
ADD PRIMARY KEY (Colour_Name);
With this primary key in place the RDMS would prevent us from having rows where the colour name was a duplicate of some other row. And this would be to our advantage as it would prevent us from having two different mean wavelengths for the same colour.
Hence, every table should have a primary key - if it doesn’t, then it violates first normal form.
Fourth 1NF Violation
The final way of violating 1NF involves repeating groups. For example, the third row in the Inventory column of the following example table is a repeating group:
Player_ID | Inventory |
---|---|
tstark | 2 lasers , 5 missiles |
srogers | 15 powerballs |
nromanoff | 4 shields, 6 staff, 10 powerballs, 8 missiles |
In the above example table, it isn’t hard to imagine there could be many other types of items that a player could have and likewise many other different combinations of these items that may be present in the player’s inventory.
So now, you could design a table that defined the inventory as a string of text - but clearly this would become overwhelming very quickly and prohibitively cumbersome to query with even the simplest of queries.
Alternatively, you could have individual columns for each type of inventory item (one for lasers, one for missiles, etc., etc.) - and indeed individual columns for the tally count of each of those items. However, again, given that (for example) there could many 100’s of possible inventory items - or worse still, there may even be no limit to the number of possible inventory items - then this too would not be a particularly practical solution either.
And so, the presence of these “repeating groups” of data items, on a single row, would violate first normal form.
A better solution would be to have individual rows for each item type that a player has in their inventory.
Player_ID | Item_Type | Item_Quantity |
---|---|---|
tstark | lasers | 2 |
tstark | missiles | 5 |
srogers | powerballs | 15 |
nromanoff | shields | 4 |
nromanoff | staff | 6 |
nromanoff | powerballs | 10 |
nromanoff | missiles | 8 |
Here we allow for the many 100’s of possible inventory items (and combinations of item) by extending the number of rows in the database table. The advantage being that our queries of the table will be the same, regardless of how many rows there are. The complexity of any query arises by virtue of it’s columnar structure (the table design) such that the number of rows of data becomes almost irrelevant. And you can imagine here we would designate the combination of the columns Player_ID
and Item_Type
as the primary key.
Rules: First Normal Form
The following would violate 1NF and are thus not permitted:
- Using row order to convey information.
- Mixing data types within the same column.
- Having a table without a primary key.
- Allowing repeating groups.
Second Normal Form (2NF)
Suppose we wished to enhance the previous example by including a player rating.
Player_ID | Item_Type | Item_Quantity | Player_Rating |
---|---|---|---|
tstark | lasers | 2 | Intermediate |
tstark | missiles | 5 | Intermediate |
srogers | powerballs | 15 | Beginner |
nromanoff | shields | 4 | Advanced |
nromanoff | staff | 6 | Advanced |
nromanoff | powerballs | 10 | Advanced |
nromanoff | missiles | 8 | Advanced |
You’ll notice that we now have repeated information, given that as a solution to the repeated groups problem we introduced a row for each inventory item - now meaning that the Player_ID
and Player_Rating
information is necessarily repeated over each row.
Here then we note that it is not good design practice to repeat information in such a way. To demonstrate, suppose Player_ID
srogers
were to lose the 15
powerball
items in their inventory. We would have to remove that row from the table. But this would also require us to remove the player rating. Thus we wouldn’t be able to query the player rating at all. It would be as if the player srogers
had entirely disappeared.
In fact this is known as a “deletion anomaly” - where the database would no longer know anything about a particular entity because it was deleted when some associated data was deleted.
Alternatively, suppose player tstark
changed their rating from Intermediate
to Advanced
. To now capture this new rating in the above table we would have to update two records (once for the row with lasers, and again for the row with missiles). But suppose that during this process, some external network error affected our connection to the database such that as a consequence only one of those records was updated. We would now have contradictory information in the table - one row saying tstark
had a Player_Rating
of Intermediate
and the other row saying it was Advanced
. Such a logical inconsistency is known as an “update anomaly”.
Finally, suppose a new player was added. Since this player is inevitably going to start with no items at all in their inventory, then there would be nothing for us to place in either the Item_Type
and Item_Quantity
columns. If the notion of a Null
field wasn’t recognised then a record (row) for this new player wouldn’t exist and thus there would also be no record of their rating. It would be as if this new player didn’t actually exist - a so-called “insertion anomaly”.
These deletion, update, and insertion anomalies determine that the above example table has violated second normal form. Thus, 2NF is about how a table’s non-primary key columns (the non-key attributes) relate to the primary key. In the table above, these non-key attributes are the Item_Quantity
and Player_Rating
columns and we can appreciate how neither of these attributes relate in any way to the primary key (Player_ID
and Item_Type
).
- Attribute
Item_Quantity
does depend on the entire primary key. - There is a functional dependency between the primary key and the attribute (
Item_Quantity
) - each value of the primary key is associated with exactly one value ofItem_Quantity
.
However, the same is not true for Player_Rating
. This is a property of the player only i.e. Player_ID
. This dependency is the problem - because Player_ID
alone is not the primary key.
So, the issue is that we added and attribute (Player_Rating
) to a table where it didn’t really belong. The fact that Player_Rating
is a property of Player_ID
(alone) tells us that a player is an important concept in it’s own right - and thus deserves it’s own table.
Player_ID | Player_Rating |
---|---|
tstark | Intermediate |
srogers | Beginner |
nromanoff | Advanced |
While our original table can stay as it was - except of course with the attribute Player_Rating
now removed.
Player_ID | Item_Type | Item_Quantity |
---|---|---|
tstark | lasers | 2 |
tstark | missiles | 5 |
srogers | powerballs | 15 |
nromanoff | shields | 4 |
nromanoff | staff | 6 |
nromanoff | powerballs | 10 |
nromanoff | missiles | 8 |
And thus we can now say, for both tables, there are no “part-primary-key dependencies”. That is, it is always the case that every non-primary key attribute depends on the whole primary key, and not just part of it.
Rules: Second Normal Form
The following must be maintained to avoid violating 2NF and are thus required:
- There can be no “part-key” dependencies.
- Each non-primary key must depend on the entire primary key.
Third Normal Form (3NF)
Suppose, once again we decide to enhance our tables from the previous example. This time by adding an attribute for Player_Skill_Level
. And so we define that, with respect to Player_Rating
, a Player_Skill_Level
between 1 and 3 indicates a Beginner
, 4 to 6 indicates Intermediate
, and 7 to 9 means Advanced
.
However, since we now have Player_Rating
and Player_Skill_Level
in the same table a new problem arises. Consider, if we were to update a player rating from skill level 2 to 4, we should also update their rating from Beginner
to Intermediate
. However, suppose another network error effects our connection to the database such that attribute Player_Rating
is not updated. Once again we would have a data inconsistency.
The point here is that we went from a situation where Player_Skill_Level
was functionally dependent on Player_ID
to one where now Player_Rating
is also dependent on Player_ID
but only indirectly since it is first/directly dependent on Player_Skill_Level
. This is termed as a “transitive dependency” between Player_Rating
and Player_ID
.
Hence the problem is located in the direct relationship between Player_Skill_Level
and Player_Rating
- because third normal form forbids this kind of transitive dependency. That is, the dependency of a non-primary-key attribute on another non-primary-key attribute.
A simple solution to move us into 3NF is to remove Player_Rating
from the table altogether - and thus introduce a new table that uses Player_Skill_Level
as the primary key and the Player_Rating
and the non-primary key attribute.
Player_ID | Player_Skill_Level |
---|---|
tstark | 4 |
srogers | 2 |
nromanoff | 8 |
Player_Skill_Level | Player_Rating |
---|---|
4 | Intermediate |
2 | Beginner |
8 | Advanced |
Thus the Player_Skill_Level
table tells us everything we need to know about how to translate the Player_Skill_Level
into a Player_Rating
. Every non-key attribute in a table should depend on the primary key, the whole primary key, and nothing but the primary key.
Rules: Third Normal Form
The following must be maintained to avoid violating 3NF and are thus required:
- Every non-primary key attribute should depend on the primary key, the whole primary key, and nothing but the primary key.
Fourth Normal Form (4NF)
For the fourth normal form, imagine you have a database table listing car model, colour, and style. So you can choose from combinations of car model, and then for that model choose from a range of available colours and styles.
This table (e.g. Model_Colours_And_Styles_Available
) would therefore need to contain all possible combinations.
And let’s further suppose the table is already normalized to 3NF, that the primary key consists of all three columns, and finally that every attribute depends on the whole primary key and nothing but the primary key.
Model | Colour | Style |
---|---|---|
Audi A3 | Artic Blue | Coupe |
Audi A3 | Artic Blue | Cabriolet |
Audi A3 | Flame Red | Coupe |
Audi A3 | Flame Red | Cabriolet |
Audi A4 | Soylent Green | Sedan |
Audi A4 | Soylent Green | SUV |
Audi A4 | Sunshine Yellow | Sedan |
Audi A4 | Sunshine Yellow | SUV |
Despite the above table being in 3NF, it is still vulnerable to problems. For example:
- The Audi A4 is available in colours Soylent Green and Sunshine Yellow.
- But suppose Audi were to introduce a third colour: Dolphin Grey.
- This would mean we would have to add two more rows to the
Model_Colours_And_Styles_Available
table - one for Dolphin Grey Sedan and another for Dolphin Grey SUV. - But now, if we encounter e.g. a network error and only add a single new row of e.g. Dolphin Grey Sedan then of course we’ll have a data inconsistency.
Available colours are supposed to be independent of the available styles - but because of the network error (or whatever other error occurred during the update) our table is now saying a customer can choose a Dolphin Grey Audi A4 but only in the Sedan style. When of course we know that the SUV style should also be available in Dolphin Grey.
An impossible situation has thus arisen in Model_Colours_And_Styles_Available
.
In this problem the crux lies in whether colour has a functional dependency on the model. The answer is no, because a specific model isn’t associated with just one colour. And yet it seems as if colour does have some relationship to model - it is just that we must express it differently. We can say that each model has a specific set of available colours - this kind of dependency is called a multi-valued dependency, and can be expressed as:
{ Model
} ↠ { Colour
}
… and it is equally true that each model has a specific set of available styles:
{ Model
} ↠ { Style
}
Thus, what fourth normal form states is that the only kinds of multi-valued dependencies we are allowed to have in a table are multi-valued dependencies on the primary key. Model
is not in the primary key and thus table Model_Colours_And_Styles_Available
is not in 4NF.
The solution is to split the table out into multiple tables.
Model | Colour |
---|---|
Audi A3 | Artic Blue |
Audi A3 | Flame Red |
Audi A4 | Soylent Green |
Audi A4 | Sunshine Yellow |
Model | Style |
---|---|
Audi A3 | Coupe |
Audi A3 | Cabriolet |
Audi A4 | Sedan |
Audi A4 | SUV |
And thus now if we need to expand the range of colours available for a particular model we simply add a row to that specific table.
Rules: Fourth Normal Form
The following must be maintained to avoid violating 4NF and are thus required:
- Multi-valued dependencies in a table must be multivalued dependencies on the primary key.
Fifth Normal Form (5NF)
Imagine there are three different brands of ice cream available. Each of the brands’ offers a different range of flavours, as follows:
- Frosty
- Vanilla
- Chocolate
- Stawberry
- Mint Chocolate Chip
- Alpine
- Vanilla
- Rum Raisin
- Ice Queen
- Vanilla
- Stawberry
- Mint Chocolate Chip
A customer, Jason, states that he likes vanilla and chocolate flavours, but will only purchase from the brands Frosty and Alpine. Meanwhile another customer, Jane, states that she likes Rum Raisin and Mint Chocolate Chip and Strawberry, but will only purchase from the brands Alpine and Ice Queen.
When expressed in a database table we have:
Person | Brand | Flavour |
---|---|---|
Jason | Frosty | Vanilla |
Jason | Frosty | Chocolate |
Jason | Alpine | Vanilla |
Jane | Alpine | Rum Raisin |
Jane | Ice Queen | Mint Chocolate Chip |
Jane | Ice Queen | Strawberry |
But now, suppose some time in the future, Jane announces that she also likes the brand Frosty. We would therefore need to update our table:
Person | Brand | Flavour |
---|---|---|
Jason | Frosty | Vanilla |
Jason | Frosty | Chocolate |
Jason | Alpine | Vanilla |
Jane | Alpine | Rum Raisin |
Jane | Ice Queen | Mint Chocolate Chip |
Jane | Ice Queen | Strawberry |
Jane | Frosty | Mint Chocolate Chip |
Jane | Frosty | Strawberry |
Once again, suppose there was some sort of network error such that we only added the single new row of Jane, brand Frosty, and flavour Strawberry, while the other new options were not updated (again, due to that network error). As such, the other new row that should have been added (Jane, brand Frosty, and flavour Mint Chocolate Chip) was not updated.
As a result we would have a logically inconsistent table - since we had already established that Jane liked the brand Frosty and liked the flavour Mint Chocolate Chip. Thus by the omission of this option from the updated table we introduced (i.e. allowed for) the logical inconsistency.
The problem arose because right at the beginning of the problem we were given three pieces of information:
- which brands offered which flavours.
- which people liked which brands.
- which people liked which flavours.
And what we should have done from the outset was simply to have created a table for each piece of information.
Brand | Flavour |
---|---|
Frosty | Vanilla |
Frosty | Chocolate |
Frosty | Strawberry |
Frosty | Mint Chocolate Chip |
Alpine | Vanilla |
Alpine | Rum Raisin |
Ice Queen | Vanilla |
Ice Queen | Mint Chocolate Chip |
Ice Queen | Strawberry |
Person | Brand |
---|---|
Jason | Frosty |
Jason | Alpine |
Jane | Alpine |
Jane | Ice Queen |
Person | Flavour |
---|---|
Jason | Vanilla |
Jason | Chocolate |
Jane | Rum Raisin |
Jane | Mint Chocolate Chip |
Jane | Strawberry |
Such a design would both solve the logical inconsistency discussed - and is in 5NF. Simply put, to ensure that a table (which is already in 4NF) is in 5NF we need to ask whether the table can be logically thought of as being the result of joining some other tables together. If it can be thought of in this way then it is not in 5NF.
Usage of the above three tables within a SQL query might look something like the following:
SELECT
pbrand.Person,
bf.Brand,
bf.Flavour
FROM
Preferred_Brands_By_Person pbrand
INNER JOIN
Preferred_Flavours_By_Person pflavour
ON
pbrand.Person = pflavour.Person
INNER JOIN
Available_Flavours_By_Brand bf
ON
pbrand.Brand = bf.Brand
AND
pflavour.Flavour = bf.Flavour
Rules: Fifth Normal Form
The following must be maintained to avoid violating 5NF and are thus required:
- It must not be possible to describe a table, which is already in fourth normal form, as being the logical result of joining some other tables together.
Conclusion
As a conclusion it is sufficient to simply say that the number of levels of normalization you design your database tables to adhere to depends entirely on the problem at hand. It may be perfectly reasonable to expect your tables to meet only 1NF and 2NF.
As a rule of thumb, the more data you are storing, and/or the more complex the problem, then the wiser it is to ensure your design adheres to higher levels of normalization. It is the complexity of the problem that usually dictates how many relationships you need to capture, and how many places bad data can expose weakness in your design. By designing against a higher level of normalization from the outset you reduce the likelihood of data gremlins ruining your table design.