5NF 和数据库设计
5NF and Database Design

原始链接: https://kb.databasedesignbook.com/posts/5nf/

## 解构第五范式 (5NF) 本文挑战了关系数据库中关于5NF的传统教学,认为目前的解释过于复杂。它提倡一种设计流程,先明确业务需求和逻辑模型,*再*考虑规范化。 作者批评了常见的例子——包括维基百科的“旅行商”场景——是人为构建且无益的。相反,它提出了两种实际模式:“AB-BC-AC三角形”(以冰淇淋偏好为例)和“ABC+D星形”(使用音乐家、音乐会和乐器)。这些模式自然地从明确定义的业务逻辑中产生。 至关重要的是,文章强调5NF分解不是必要的步骤。使用复合主键还是合成主键取决于业务规则所规定的唯一性约束,而不是为了追求5NF本身。 作者展示了如何通过添加额外的需求(例如特定的口味偏好或音乐家的技能)来扩展这些例子,这只是增加了逻辑模型,而不是需要复杂的重构。最终目标是实现一个规范化的模式,消除冗余和异常,通过逻辑、需求驱动的设计过程来实现——使得显式考虑5NF往往是不必要的。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 5NF 和数据库设计 (databasedesignbook.com) 24 分,来自 petalmind 37 分钟前 | 隐藏 | 过去 | 收藏 | 2 条评论 帮助 tadfisher 15 分钟前 [–] 我喜欢阅读关于范式的文章,因为这让我听起来好像我知道我在说什么,尤其是在后端人员告诉我“如果我们规范化这些数据,数据库就会崩溃”的对话中。这通常会接着讨论 UUID 版本,出于某种原因。回复 necovek 2 分钟前 | 父评论 [–] 所以他们争论哪个范式是好是坏?哪个 UUID 版本赢得了争论? 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Author: Alexey Makhotkin [email protected], (~3900 words)

One of the goals of this publication is to deconstruct the traditional ways of teaching basic topics in relational databases. Previously we discussed the fourth normal form (4NF): “Historically, 4NF explanations are needlessly confusing”.

Let’s discuss the ultimate beast: fifth normal form (5NF). Often it’s presented even more confusingly than 4NF is, and we can show that this presentation is unnecessary, and the confusion is completely artificial.

Here is the roadmap of this post:

  • It’s essential to have good, well-motivated examples. We start with a survey of examples used in various texts that present 5NF.
  • Wikipedia should have a good baseline explanation of a concept, but in the case of 5NF, it does not. We discuss the problems with the Wikipedia example.
  • We discuss a more logical sequence of designing tables: a) start with business requirements, b) create a logical model, c) design physical table schema.
  • We discuss the “ice cream” example, and the AB-BC-AC triangle pattern that naturally appears here.
  • We discuss the “musicians” example, and the ABC+D star pattern that appears here. For some business requirements, there is also a choice between using a composite or a synthetic primary key.
  • Finally, we come to the conclusion that you don’t really need to involve 5NF to design your table schema. You begin with the logical model, and apply a textbook table design strategy that preserves normalization.
  • In two extra chapters we discuss extending both examples, with two different teaching points.

Table of contents

Subscribe here to receive updates.


Sources

First, we need to list the teaching scenarios that are used in the commonly available sources.

Wikipedia: Salesman / Brand / Product Type, with the following relationships:

  • Salesman sells product types (e.g. “vacuum cleaner”);
  • Salesman sells brands (e.g. “Philips”);
  • Brands offer product types;

https://en.wikipedia.org/w/index.php?title=Fifth_normal_form&oldid=1344596188#Example

Decomplexify: Ice cream brands / Flavours / Friends, with the following relationships:

  • Brands offer flavours (e.g., Frosty’s offers Vanilla, Chocolate, Strawberry, and Mint);
  • Friends like flavours (e.g. Jason likes vanilla and chocolate);
  • Friends like brands (e.g., Jason likes Frosty’s and Alpine);

https://www.youtube.com/watch?v=GFQaEYEc8_8&t=1427s (see also the “corrections” in the pinned comment).

Data Demythed blog by Barry Johnson: Concerts / Musicians / Instruments / Performances.

  • Musician participates in a Concert;
  • Musician is able to play an Instrument;
  • Musician plays an Instrument at a certain Concert;

https://datademythed.com/posts/5nf_missing_use_case/
(I particularly recommend the “In Conclusion” section)

Barry’s text also discusses the traditional poor treatment of 5NF. Back then both of us had a discussion in the comments, which helped me to make some progress in understanding. Here I’ll discuss my treatment of this business case, because it seems to be structurally different from the previous two examples.

Making sense of Wikipedia example

Wikipedia presents its example in a roundabout way. First it shows a three-column table (“Traveling salesman”, “Brand”, “Product type”).

Wikipedia example: traveling-salesman product availability by brand

What does this table mean? Let’s discuss later.

Then there is the following quote:

“In the absence of any rules restricting the valid possible combinations of traveling salespeople, brand, and product type, the three-attribute table above is necessary in order to model the situation correctly.

Suppose, however, that the following rule applies: A traveling salesperson has certain brands and certain product types in their repertoire. If brand B1 and brand B2 are in their repertoire, and product type P is in their repertoire, then (assuming brand B1 and brand B2 both make product type P), the traveling salesperson must offer product type P from both brands; that is, the salesperson cannot sell only B1’s product P or only B2’s product P.

In that case, it is possible to split the table into three:

Wikipedia example: Product types by traveling salesman / Brands by traveling salesman / Product types by brand

In this case, it’s impossible for Louis Ferguson to refuse to offer vacuum cleaners made by Acme (assuming Acme makes vacuum cleaners) if he sells anything else made by Acme (lava lamp) and he also sells vacuum cleaners made by any other brand (Robusto).”

The “impossible to refuse” wording, and the entire premise of “must offer product type P from both brands” describe a weird scenario. I am not sure if it ever happens in real life.

Setting the tables aside: the rule itself is weird, it means that we cannot handle the normal case. If I want to start selling Robusto vacuum cleaners then I must also start selling Acme vacuum cleaners. If I want to sell Acme breadboxes, I must also sell Acme vacuum cleaners.

This makes no business sense, and because of that you won’t understand 5NF from this example. Nowadays I probably understand what this contrived wording is trying to achieve pedagogically, but I think that it’s failing at that. It’s not useful for understanding database design.

Begin with logical model

The entire premise is illogical. A table is presented, and we ask: what could be the meaning of this table? Then we split this table into three tables, and we ask again: what could be the meaning of those tables? We ask: what if there was this extra rule that makes very little sense, if we’d take it into account would we interpret the tables differently?

Instead, we must begin with the logical model that corresponds to the actual business scenario. When the logical model is complete, we can build a physical schema, using a textbook table design strategy. The result would be perfectly normalized: no redundancy, and no anomalies.

Two logical design patterns arise around 5NF:

  • AB-BC-AC triangle (see the “ice cream” example below), and
  • ABC+D star pattern (see the “musicians” example).

AB-BC-AC triangle: the “ice cream” example

Here is my rephrasing of the “ice cream” example, presented in the Decomplexify video on database normalization:

There are several ice cream brands on the market, e.g. “Frosty’s”, “Alpine”, “Ice Queen”, etc. Each brand produces one or more flavours of ice cream, e.g. vanilla, strawberry, and rum raisin. We want to create a database of our friends’ ice cream preferences. Each friend likes some brands, and also some flavours. We assume that those preferences intersect: that is, if a friend likes brands A and B and flavours 1 and 2 that means that they like exactly A-1, A-2, B-1, and B-2 (restricted to pairs the brand actually produces). They won’t like A-3 or C-1.

Two things to note: first, this example is more plausible than Wikipedia’s one, we’ll take it. Second, there is no requirement to handle specific tastes. For example, we don’t record the fact that the aforementioned friend likes D-4 additionally, nor the fact that they don’t like B-1.

Let’s write down the logical model that corresponds to the description above, using notation from “Database Design Book”. We’ll have three anchors (entities):

Anchor ID example Table name
Brand “Alpine” brands
Flavour “vanilla” flavours
Friend “Jason” friends

In the real database we’d use proper integer IDs, but this one is a tiny teaching database, so it makes sense to use so-called natural keys which are unique strings. We assume that all your friends have different names. In the textbook table design strategy, each anchor has a corresponding table. The table names are written in the last column.

We don’t need any attributes here, because all information about entities is basically contained in the ID.

But the most interesting part is the list of links. 5NF is mostly about the links.

Anchor1 : Anchor2 Cardi-
nality
Sentences Table or column names
Brand : Flavour M:N A Brand produces several Flavours

A Flavour is produced by several Brands

brand_flavours
Friend : Brand M:N A Friend prefers several Brands

A Brand is preferred by several Friends

friend_brands
Friend : Flavour M:N A Friend prefers several Flavours

A Flavour is preferred by several Friends

friend_flavours

A link is a relationship between two anchors. Here we have three links, connecting three anchors into a triangle, as we mentioned:

Brand / Flavour / Friend anchors

All three links have M:N cardinality. It is clearly visible because all the sentences contain the word “several”. In the textbook table design strategy, each M:N link has a corresponding two-column table (also known as junction table). The names of the tables are written down in the last column.

Here are three link tables, created based on the logical model. It uses the same dataset as presented in the Decomplexify video.

Dataset: brand_flavours, friend_brands, friend_flavours

One thing that we can see here is that Jason likes chocolate, but none of the brands produce it.

A useful exercise for the interested reader would be to write an SQL query that returns the brand offerings that would suit each friend, based on their preferences.

Note that there would exists three more tables: brands, flavours, and friends, containing IDs used in the link tables.

Dataset: brands, flavours, friends

Note that for each entity there is one more row that is not present in any of the link relationships. We know about the Coldflash brand, but we don’t know which flavours it provides, and nobody prefers it. We have a Kiwi flavour which is not produced by any brand and not preferred by anyone. And we have Suzy, whose preferences we have not yet registered.

ABC+D star pattern: the “musicians” example

Now let’s discuss an example from the “5NF: The Missing Use Case” post by Barry Johnson.

Suppose that we want to record information about concerts and musicians that played certain instruments in those concerts. We begin with the three anchors:

Anchor ID example Table name
Concert “christmas-2025” concerts
Instrument “violin” instruments
Musician “Patricia” musicians

Now let’s draft some possible links. This is one of the database modeling examples where natural language may mislead you, so we need to be careful and precise.

In plain English we’d say: “A Musician played an Instrument in a Concert”, but this sentence includes three anchors and not two. Links always connect two anchors. Even though the relational model allows relations with more than two elements, we insist on only using 2-element links because they help you design unambiguous tables.

Here is an example from a real-world concert, listing musicians and their instruments:

  • Marc, violin;
  • Gilles, violin;
  • Vlad, viola;
  • Yovan, cello.

You see that several musicians can play an instrument. If you think about it, it’s possible that one musician could play several instruments.

While we’re at it, why are we even building this database? Suppose that we want to track how much money musicians should get paid. Violin players get paid $x; maybe somebody was asked to help out and play some cymbals in the first half and a gong at the end, so they’d be paid $y + $z.

When you see a list of something, very often it would be an anchor. List items could be counted, and you can add one more item to the list: a classic anchor.

We need to find a word for this — let’s try “Performance”:

Anchor ID example Table name
Performance <two options possible, see below> performances

Finding words like this is often hard, because natural language is ambiguous.

(Note that we use a synthetic integer ID for the performances.)

Let’s try to find links between the four anchors now. A concert is made of individual performances. A performance involves a specific musician playing a specific instrument. Let’s formalize this:

Anchor1 : Anchor2 Cardi-
nality
Sentences Table or column names
Concert : Performance 1:N A Concert consists of several Performances

A Performance is a part of only one Concert

performances.
concert_id
Instrument : Performance 1:N An Instrument is played in several Performances

A Performance involves only one Instrument

performances.
instrument_id
Musician : Performance 1:N A Musician may perform several Performances

A Performance is performed by only one Musician

performances.
musician_id

Read the sentences: they may sound a bit awkward, but you need to confirm that they make sense. It’s crucial to make sure that “only one” and “several” is used correctly, because that’s how you define cardinality. Read aloud if needed, or read to somebody else — that’s how you prevent design mistakes.

In this case, all three links have cardinality 1:N.

Here is a diagram that has a distinctive three-pointed star shape:

Concert, Musician, Instrument, Performance: anchors and links

This is why we call it the ABC+D star pattern. We have three “easy” words: concert, musician, and instrument. But you must also realize that there is an underlying concept of performance. Using a formalized approach, such as the one presented in the “Database Design Book”, helps a lot, making the design process more reliable and error-proof.

Physical table design: primary keys

The concept of normal forms only appears when we talk about physical table design. The ice cream example was simple and straightforward, but the current scenario is a bit more complicated, because we have to talk about uniqueness constraints.

Textbook table design strategy could be applied directly:

  • Performance anchor is implemented as “performances” table;
  • All three links are 1:N, so they are implemented as columns in that table (“concert_id”, “musician_id”, and “instrument_id”).

But here is the question: what would be the primary key of that table? In most anchor tables we’d use a synthetic ID column, but here we can notice that the combination of (concert_id, musician_id, instrument_id) must be unique. In awkward semi-formalized English: a musician is paid for performing an instrument only once for a given concert (this follows from business requirements).

If we use a synthetic ID column, we need to add a uniqueness constraint:

CREATE TABLE performances (  
  id INTEGER NOT NULL PRIMARY KEY,  
  concert_id INTEGER NOT NULL,  
  musician_id INTEGER NOT NULL,  
  instrument_id INTEGER NOT NULL,  
  UNIQUE (concert_id, musician_id, instrument_id)  
);

The data looks like this:
performances table, using synthetic ID primary key

But what the relational theory textbooks want you to know is that you can also design this table by using a composite primary key (which also serves as a uniqueness constraint):

CREATE TABLE performances (  
  concert_id INTEGER NOT NULL,  
  musician_id INTEGER NOT NULL,  
  instrument_id INTEGER NOT NULL,  
  PRIMARY KEY (concert_id, musician_id, instrument_id)  
);

And the data will look like this:
performances table, using composite primary key

The choice of composite vs synthetic primary key only appears when business requirements assume uniqueness. In other scenarios there is no uniqueness, and you must use the synthetic primary key.

Imagine an online game where users can buy certain items and give them as gifts to other users. The same user can give the same type of item to the same friend multiple times, so there is no uniqueness. We’d still have the same ABC+D star pattern, though.

Let’s go back to the three-column table variant. It’s the same structure as was used in Wikipedia and in many other textbooks and tutorials. We did not need to split it into three two-column tables because the business requirements are different from the “ice cream” example.

More importantly, we did not need to reason in terms of 5NF decomposition at all. We only needed to define business requirements, and construct the physical tables in a straightforward way.

Yet, we’ve seen both types of tables that are discussed in a typical 5NF tutorial:

  • three two-column tables (AB-BC-AC triangle);
  • one table (ABC+D star):
    • three-column if uniqueness is required and we want a composite PK;
    • one extra ID if there is no uniqueness or we want a synthetic PK;

Also, we did not need to perform this illogical operation of “splitting tables”, because we just designed all the tables correctly from the beginning. In practice they would not be split or joined as part of the database design process.

Both presented patterns are not mutually exclusive. You can have several different links that connect the same anchors in a different way.

Let’s go back to the ice cream example and extend it. Some of our friends are extra picky: they prefer specific flavours from a specific brand, for example kiwi by Coldflash, but no other flavours by Coldflash, and no other kiwi brands.

We need to handle this example by extending the logical model defined above. It’s important to note that the original behavior stays as it is: all the logical schema and existing tables stay as they are. We only add new entries to the logical schema, and we’ll add some new tables.

Having discussed the “musicians” example, we can immediately recognize how to model “Frank loves Coldflash kiwi”. We introduce a new anchor called Preference, and three links arranged as an ABC+D star pattern.

Logical schema is basically a copy-paste of the “musicians” example. Anchors:

Anchor ID example Table name
Preference <two options are possible> preferences

And links:

Anchor1 : Anchor2 Cardi-
nality
Sentences Table or column names
Friend : Preference 1:N A Friend can have several Preferences

A Preference belongs to only one Friend

preferences.
friend_id
Brand : Preference 1:N A Brand is involved in several Preferences

A Preference refers to only one Brand

preferences.
brand_id
Flavour : Preference 1:N A Flavour is involved in several Preferences

A Preference refers to only one Flavour

preferences.
flavour_id

Same as with “musicians” example, we have the option of synthetic ID or a composite PK. In this example each (Friend, Brand, Flavour) combination must be unique, so we can use a composite PK if we want.

From those new elements we create a new “preferences” table, that exists alongside the three other tables: “brand_flavours”, “friend_brands”, and “friend_flavours”. We still capture both the information about broader preferences and about specific preferences, using two sets of tables, designed accordingly.

The point of this section is to reinforce the idea that both patterns are equally useful, one is not a replacement for the other. Sometimes they could even be used simultaneously. Which of them is applicable depends only on your business requirements.

Let’s go back to the “musicians” example. If you read the “5NF: The Missing Use Case” post, you’ll see that at some point an idea of “skill” is introduced. Basically it’s a link that connects a Musician and an Instrument. This idea makes sense in natural language: “Alice can play violin, so she gets invited to play in the classical concerts”.

The data is modeled via this Skill in the text like this:

Concert / Musician / Instrument / Performance, including Skill

This differs from the model discussed in this post. Why? I’ve been thinking about this exact issue for quite some time, and came to look at the problem from a different angle.

I think that “A Musician can play an Instrument” is a valid link, but it corresponds to a different business process. We discussed capturing the information about who played which instrument in specific concerts.

But if you think about that, it’s possible to play an instrument in the concert without necessarily declaring that you have a skill to play it.

Consider the quote from that post:

“Looking at our example a bit further, for instance, it’s easy to conceive of a SkillRating in [Skill]. This could be used to help decide which [Musician] instance(s), with corresponding [Instrument] instance(s), will be invited to be part of a [Concert] instance.”

It makes total sense to extend our original logical model by capturing the information about which musicians are willing to be invited to the concerts to play certain instruments. Here is the link:

Anchor1 : Anchor2 Cardi-
nality
Sentences Table or column names
Instrument : Musician M:N An Instrument can be played by several Musicians

A Musician can play several Instruments

musician_skills

Note that this link has M:N cardinality. Because of that, we had to add another two-column table (also known as a junction table). It is completely independent from the “performances” table, in both ways.

First, we can record that Alice can play violin, but we don’t have any records about her playing anything. This is normal: maybe she just started, or our own dataset is far from complete.

Second, somebody can play in a concert, but they do not necessarily want to be invited to further concerts. Maybe they’ve retired and are only doing exceptions for special occasions.

Or, somebody needs to play hammer in Mahler’s Symphony No. 6 (there are only two or three blows in the entire piece). It’s probably not necessary to have a “skill” for that, but you certainly want to be mentioned in the lineup because it’s such an important part.

So, it’s a question of what you want to implement. Is it a sort of marketplace for musicians? Then you need “musician_skills”. Or, if you want to have a record of who played when — that’s what “performances” table is for. Maybe you want both. It’s up to you to decide.

Additional reading

Here are three more texts that may be useful if you want to understand more about the history of normal forms.

  1. William Kent “A Simple Guide to Five Normal Forms in Relational Database Theory” (1982)

5NF is discussed in chapter 4.2. We already used this text to better understand historical aspects of 4NF definition. There is a confusing sentence there:

“Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record.”

It seems that there are too many negatives here, and it’s hard to follow. But I won’t attempt to edit this sentence: my goal is to present a completely different way of reaching the same result.

  1. Andrei Pall “Database Normalization” (2020)

This post contains a beautiful table that shows the full range of normal forms (including non-numerical ones), explaining which specific criteria motivated each of them.

  1. Hugh Darwen “An Introduction to Relational Database Theory” (2010)

The book is freely downloadable. On page 11 you can find “Note to Teachers” which alone is hugely enlightening. It says that 4NF is just a simple special case of 5NF, which is a useful piece of spiritual guidance.

Another quote:

Also in Chapter 7, [I have aimed] to study the normal forms in reverse order to that in which they are normally presented. I put 6NF first because it is the simplest and also the most extreme. More important to me was to deal with 5NF and join dependencies before BCNF and functional dependencies (though I do leave to the end discussion of those pathological cases where BCNF is satisfied but not 5NF).

This was a big confirmation that my own approach to teaching normal forms is plausible — I also think that 6NF is the foundational form.

5NF is presented in Section 7.4 (page 185), using the “Wives of Henry VIII” example. This is not a very good example because it is rather misaligned with the typical real-world scenarios. Nevertheless, the book, as well as other materials by Darwen, is a treasure.


“Database Design Book” (2025)

Learn how to get from business requirements to a database schema

If this post was useful, you may find this book useful too.

Table of contents and sample chapters

Book length: 145 pages, ~32.000 words. Available in both PDF and in EPUB format.

Purchase for €32


Conclusion

Traditionally 5NF is presented with an unmotivated example of a 3-column table. This table is then “split” into three 2-column tables. The splitting operation is not very well motivated and does not happen in practice.

This makes 5NF harder to understand and encourages the air of mysticism around the higher normal forms.

We propose a more straightforward way, better aligned with the practice of database design. We start from the business requirements and logical schema. Based on that, two patterns of physical table design follow naturally: AB-BC-AC triangle, or an ABC+D star pattern. Both patterns, when designed this way, are fully normalized: there are no anomalies and no redundancy.

I’d be happy to hear your feedback:
Alexey Makhotkin [email protected].

联系我们 contact @ memedata.com