列式存储即是规范化。
Columnar Storage Is Normalization

原始链接: https://buttondown.com/jaffray/archive/columnar-storage-is-normalization/

## 行 vs. 列式数据:深入分析 从行式到列式数据存储的转变并非对关系数据库原则的背离,而是在该模型*内部*的一种不同编码方式。行式数据库将数据存储为完整的记录(行),使得添加新行和检索整个记录变得高效。然而,分析特定列——例如计算颜色直方图——需要读取不必要的数据。 列式数据库则反过来。每个列的数据存储在一起,优化了专注于特定属性的分析查询。虽然读取单个列很快,但重建完整行需要从多个位置收集数据,使得行检索速度较慢。 这可以被视为一种极致的数据库规范化形式。列式存储不像单个宽表,更像多个窄表,通过隐式主键(数据位置)连接。本质上,“重建一行”*就是*一个连接操作。 理解这种视角突出了数据格式操作在常见查询操作(如投影和连接)背后的作用。虽然通常是一个隐藏在查询后的实现细节,但将列式存储识别为一种特定的数据编码方式,为数据库性能和优化提供了一个强大的思维模型。

最近一篇Hacker News上的帖子引发了关于一个引人深思的观点讨论:**列式存储可以被看作是一种规范化的形式。** 原文章作者认为,列式格式通常可以减少对单个值进行索引的需求,而是依赖于列内数据的顺序——这种做法可能与传统的规范化原则相悖。 然而,评论者指出一个关键区别:**规范化是一个*逻辑*数据设计概念,而列式存储是一个*物理*实现。** 虽然这种类比有助于理解,但将两者等同起来可能会产生误导。 对话还涉及到了规范化的历史背景,最初是受到存储容量限制的驱动,以及展示存储效率如何能成为一个对开发者有力的论据。 一位评论者半开玩笑地问这个观点是否与第六范式有关。
相关文章

原文

Something I didn't understand for a while is that the process of turning row-oriented data into column-oriented data isn't a totally bespoke, foreign concept in the realm of databases. It's still of the relational abstraction. Or can be.

As an example, say we have this data:

data = [
    { "name": "Smudge", "colour": "black" },
    { "name": "Sissel", "colour": "grey" },
    { "name": "Hamlet", "colour": "black" }
]

This represents a table in a relational database. Let's assume this was a table in a relational database and we had to do all sorts of disk-access, whatever, to access any particular part of the data. This representation has some nice properties.

It's easy to add a new row: we can just construct a row:

{ "name": "Petee", "colour": "black" }

and add it to the end of our already-existing list. On disk, we probably only have to touch a couple pages to do it. And if our row were really wide, in that it had a whole bunch of columns, that wouldn't really change. It would still have that nice property.

This is also true of looking up a row. Since all of a row's columns are stored next to each other, it's very fast to just pull that row out from wherever its stored.

Conversely, if we were to want to, say, compute a histogram of the different pet colours, we have to read quite a lot of data we don't care about in order to do so.

This is a row-oriented representation of the data. A column-oriented representation would look something like this:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

This has all the opposite tradeoffs of the row-oriented design: if we only care about colour, we can very effectively read only that data. We don't have to read their names at all. But modifying the data, or reading a specific row, becomes harder. We have to go all over the place to do them both. If we want the second row, we have to go to the second index in each column to reconstruct the original row.

So, one way to think about this shaping of the data is that it's encoding level. That it lives at a level of abstraction firmly beneath that of the data model: a SQL engine on top of it can't distinguish between the two, except via the performance characteristics of various queries.

A different way to think about columnarization like this is that it's akin to very extreme type of database normalization.

Instead of one wide table that's represented by a bunch of vectors of data, you might think of columnar data as a set of tables which all have a primary key plus one additional attribute:

Denormalized table:

+----+------+-----+
| id | name | age |
+----+------+-----+
| 12 | Bob  |  30 |
| 93 | Tom  |  35 |
| 27 | Kim  |  28 |
+----+------+-----+

Normalized tables:

Name

+----+------+
| id | name |
+----+------+
| 12 | Bob  |
| 93 | Tom  |
| 27 | Kim  |
+----+------+

Age

+----+-----+
| id | age |
+----+-----+
| 12 |  30 |
| 93 |  35 |
| 27 |  28 |
+----+-----+

We can easily reconstruct the original table with a join on the id column.

In the context of a columnar-stored table, you can think of the primary key as the ordinal position of a given piece of data.

Our original data:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

Looks like this:

+----+--------+
| id |   name |
+----+--------+
|  0 | Smudge |
|  1 | Sissel |
|  2 | Hamlet |
+----+--------+

+----+--------+
| id | colour |
+----+--------+
|  0 |  black |
|  1 |   grey |
|  2 |  black |
+----+--------+

However, the id column is just implied by the position in the arrays:

+--------+
|   name |
+--------+
| Smudge |
| Sissel |
| Hamlet |
+--------+

+--------+
| colour |
+--------+
|  black |
|   grey |
|  black |
+--------+

I think the value of this perspective is that it unifies a lot of traditional query-processing operations, like projections, and joins, with manipulation of data formats. Some, many, most times, you probably should think about data formats like this as an implementation detail that queries are logically blind to, but it's a useful mental model to realize that "reconstructing a row from columnar storage" doesn't just look like performing a join, it is a join.

联系我们 contact @ memedata.com