5NF and Database Design

upvote

5NF and Database Design

(kb.databasedesignbook.com)

108 points

by petalmind6 hours ago |

upvote

by jerf4 hours ago|

[-]

In a roundabout way this article captures well why I don't really like thinking in terms of "normal forms", especially as a numbered list like that. The key insights are really 1. Avoid redundancy and 2. This may involve synthesizing relationships that don't immediately obviously exist from a human perspective. Both of those can be expanded on at quite some length, but I never found much value in the supposedly-blessed intermediate points represented by the nominally numbered "forms". I don't find them useful either for thinking about the problem or for communicating about it.

Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM) doesn't mean it is actually useful... sometimes it just means that it made it easy to write multiple choice test questions. (e.g., "What does Layer 2 of the OSI network model represent? A: ... B: ... C: ... D: ..." to which the most appropriate real-world answer is "Who cares?")

reply

upvote

by petalmind4 hours ago|

[-]

> Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM)

One problem is that normal forms are underspecified even by the academy.

E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).

2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).

Also, the numeric sequence is not very useful because there are at least four non-numeric forms (https://andreipall.github.io/sql/database-normalization/).

Also, personally I think that 6NF should be foundational, but that's a separate matter.

reply

upvote

by jerf4 hours ago|

[-]

"1979 (I was one year old then)."

Well, we are roughly the same age then. Our is a cynical generation.

"One problem is that normal forms are underspecified even by the academy."

The cynic in me would say they were doing their job by the example I gave, which is just to provide easy test answers, after which there wasn't much reason to iterate on them. I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.

reply

upvote

by awesome_dude27 minutes ago|

[-]

The levels do the most important thing in computer science, give discrete and meaningful levels to talk/argue about at the watercolour

reply

upvote

by wolttam4 hours ago|

[-]

Why shouldn’t we care about layer 2? You can do really fun and interesting things at the MAC layer.

reply

upvote

by jerf3 hours ago|

[-]

You can do what you do at the MAC layer without any regard for whether or not it is "OSI layer 2", or whether your MAC layer "cheats" and has features that extend into layers 1, or 3, or any other layer. Failing to implement something useful because "that's not what OSI layer 2 is and this is data layer 2 and the OSI model says not to do that" is silly.

To stay on the main topic, same for the "normalization forms". Do what your database needs.

The concepts are just attractive nuisances. They are more likely to hurt someone than to help them.

reply

upvote

by minkeymaniac3 hours ago|

[-]

Normalize till it hurts, then denormalize till it works!

reply

upvote

by Quarrelsome3 hours ago|

[-]

what a marvelous motto <3.

Certainly a lot more concise than the article or the works the article references.

reply

upvote

by petalmind2 hours ago|

[-]

Imperative mood "normalize" assumes that you had something not-normalized before you received that instruction. It's not useful when your table design strategy is already normalization-preserving, such as the most basic textbook strategy (a table per anchor, a column per attribute or 1:N link, a 2-column table per M:N link).

And this is basically the main point of my critique of 4NF and 5NF. They both traditionally present an unexplained table that is supposed to be normalized. But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?

It's like saying that to in order to implement an algorithm you have to remove bugs from its original implementation — where does this implementation come from?

The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.

Because of that if you, as a novice, look at a typical production schema, and you have this "thou shalt normalize" instruction, you'll be confused.

This is my big teaching pet peeve.

reply

upvote

by Quarrelsome2 hours ago|

[-]

> But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?

I find the bafflement expressed in the article as well as the one linked extremely attractive. It made both a joy to read.

Were I to hazard a guess: Might it be a consequence of lack of disk space in those early decades, resulting into developers being cautious about defining new tables and failing to rationalise that the duplication in their tragic designs would result in more space wasted?

> The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.

Agreed, but as the OP comment stated they usually started out normalised and then pushed out denormalised representations for nice contiguous reads.

As a victim of maintaining a stack on top of an EAV schema once upon a time, I have great appreciation for contiguous reads.

reply

upvote

by petalmind2 hours ago|

[-]

> Might it be a consequence of lack of disk space in those early decades

A plausible explanation of "normalization as a process" was actually found in https://www.cargocultcode.com/normalization-is-not-a-process... ("So where did it begin?").

I hope someday to find some technical report of migrating to the relational database, from around that time.

reply

upvote

by Quarrelsome2 hours ago|

[-]

> Normalization-as-process makes sense in a specific scenario: When converting a hierarchical database model into a relational model.

That makes much more sense as reasoning.

If I can also offer a second hazard of guess. I used to work in embedded in the 2000's and it was absolutely insane how almost all of the eldy architects and developers would readily accept some fixed width file format for data storage over a sensible solution that offered out of the box transactionality and relational modelling like Sqlite. This creates a mindset where each datastore is effectively siloed and must contain all the information to perform the operation, potentially leading to these denormalised designs.

Bit weird, given that was from the waterfall era, implying that the "Big Design Up Front" wasn't actually doing any real thinking about modelling up front. But I've been in that room and I think a lot of it was cargo cult. To deal with the insanity of simple file I/O as data, I had to write a rudimentary atomicity system from scratch in order to fix the dumb corruption issues of their design when I would have got that for free with Sqlite.

reply

upvote

by Quarrelsome3 hours ago|

[-]

Especially loved the article linked that was dissing down formal definitions of 4NF.

reply

upvote

by iFire4 hours ago|

[-]

https://en.wikipedia.org/wiki/Essential_tuple_normal_form is cool!

Since I had bad memory, I asked the ai to make me a mnemonic:

* Every

* Table

* Needs

* Full-keys (in its joins)

reply

upvote

by petalmind2 hours ago|

[-]

I have so many questions about that. Should that normal form basically replace 5NF for the purposes of teaching?

Why do they hate us and do not provide any illustrative real-life example without using algebraic notation? Is it even possible?

I just want to see a CREATE TABLE statement, and some illustrative SELECT statements. The standard examples always give just the dataset, but dataset examples are often ambiguous.

> (in its joins)

Do you understand what are "its" joins? What is even "it" here.

I'm super frustrated. This paper is 14 years old.

reply

upvote

by iFire2 hours ago|

[-]

https://dl.acm.org/doi/10.1145/2274576.2274589

I'll try reading it again.

reply

upvote

by iFire2 hours ago|

[-]

Chris Date has a course on this using his parts and supplies example. Don't have time to find it but maybe ai can find it.

https://www.oreilly.com/videos/c-j-dates-database/9781449336...

https://www.amazon.ca/Database-Design-Relational-Theory-Norm...

reply

upvote

by cremer2 hours ago|

[-]

The numbered forms are most useful as a teaching device, not an engineering specification. Once you have internalized 2NF and 3NF violations through a few painful bugs, you start spotting partial and transitive dependencies by feel rather than by running through definitions. The forms gave you the vocabulary. The bugs gave you the instinct..

reply

upvote

by estetlinus5 hours ago|

[-]

The lost art of normalizing databases. ”Why is the ARR so high on client X? Oh, we’re counting it 11 times lol”.

I would maybe throw in date as an key too. Bad idea?

reply

upvote

by hilariously1 hours ago|

[-]

It depends on if you are doing OLTP (granular, transactional) vs OLAP (fact/date based aggregates) - dates are generally not something you'd consider in a fully normalized flow to uniqify records.

reply

upvote

by petalmind5 hours ago|

[-]

Frankly I don't think that overcounting is solved by normalizing, because it's easy to write an overcounting SQL query over perfectly normalized data.

I tried to explain the real cause of overcounting in my "Modern Guide to SQL JOINs":

https://kb.databasedesignbook.com/posts/sql-joins/#understan...

reply

upvote

by akdev1l2 hours ago|

[-]

My brain has been blunted too far due to dynamodb and NoSQL storage usage and now I can’t even normalize anymore

reply

upvote

by tadfisher6 hours ago|

[-]

I love reading about the normal forms, because it makes me sound like I know what I'm talking about in the conversation where the backend folks tell me, "if we normalized that data then the database would go down". This is usually followed by arguments over UUID versions for some reason.

reply

upvote

by necovek5 hours ago|

[-]

So which normal form do they argue for and against? And what UUID version wins the argument?

reply

upvote

by tadfisher5 hours ago|

[-]

Explaining jokes is poor form.

reply

upvote

by necovek3 hours ago|

[-]

This was an attempt to extend jokes and not ask for explanation: there are a number of normal forms, and people usually talk about "normalization" without being specific thus conflating all of them; out of 7 UUID versions, only 2 generally make sense for use today depending on whether you need time-incrementing version or not.

reply

upvote

by culi5 hours ago|

[-]

On the internet it is normal.

reply

upvote

by Tostino5 hours ago|

[-]

Not OP, but UUID v7 is what you want for most database workloads (other than something like Spanner)

reply

upvote

by tossandthrow3 hours ago|

[-]

I use the null uuid as primary key - never had any DB scaling issues.

reply

upvote

by petalmind3 hours ago|

[-]

Yeah, no NULL is ever equal to any other NULL, so they are basically unique.

reply

upvote

by Groxx3 hours ago|

[-]

You are also guaranteed to be able to retrieve your data, just query for '... is null'. No complicated logic needed!

reply

upvote

by RedShift13 hours ago|

[-]

Me still using bigints... Which haven't given me any problems. Wouldn't use it for client generated IDs but that is not what most applications require anyway.

reply

upvote

by carlyai4 hours ago|

[-]

love this

reply

upvote

by DeathArrow5 hours ago|

[-]

There are use cases where is better to not normalize the data.

reply

upvote

by andrew_lettuce5 hours ago|

[-]

Typically it's better to take normalized data and denormalize for your use case vs. not normalize in the first place. Really depends on your needs

reply

upvote

by jghn5 hours ago|

[-]

Over time I’ve developed a philosophy of starting roughly around 3NF and adjusting as the project evolves. Usually this means some parts of the db get demoralize and some get further normalized

reply

upvote

by skeeter20204 hours ago|

[-]

>> Usually this means some parts of the db get demoralize

I largely agree with your practical approach, but try and keep the data excited about the process, sell the "new use cases for the same data!" angle :)

reply

upvote

by abirch4 hours ago|

[-]

I'm a fan of the sushi principle: raw data is better than cooked data.

Each process should take data from a golden source and not a pre-aggregated or overly normalized non-authorative source.

reply

upvote

by layer84 hours ago|

[-]

Sometimes the role of your system is to be the authoritative source of data that it has aggregated, validated, and canonicalized.

reply

upvote

by abirch4 hours ago|

[-]

This is great. Then I would consider the aggreated, validated, and canonicalized source as a Golden Source. Where I've seen issues is that someone starts to query from a nonauthoritative source because they know about it, instead of going upstream to a proper source.

reply

upvote

by petalmind5 hours ago|

[-]

One day I hope to write about denormalization, explained explicitly via JOINs.

reply

upvote

by andrii3 hours ago|

[-]

Please do, you content is great!

reply

upvote

by bob10294 hours ago|

[-]

JSON is extremely fast these days. Gzipped JSON perhaps even more so.

I find that JSON blobs up to about 1 megabyte are very reasonable in most scenarios. You are looking at maybe a millisecond of latency overhead in exchange for much denser I/O for complex objects. If the system is very write-intensive, I would cap the blobs around 10-100kb.

reply

upvote

by sgarland14 minutes ago|

[-]

> You are looking at maybe a millisecond of latency overhead [for 1 megabyte]

Considering the data transfer alone for 1 MB / 1 sec requires 8 Gbps, I have doubts. But for fun, I created a small table in Postgres 18 with an INT PK, and a few thousand JSONB blobs of various sizes, up to 1 MiB. Median timing was 4.7 msec for a simple point select, compared to 0.1 msec (blobs of 3 KiB), and 0.8 msec (blobs of 64 KiB). This was on a MBP M4 Pro, using Python with psycopg, so latency is quite low.

The TOAST/de-TOAST overhead is going to kill you for any blobs > 2 KiB (by default, adjustable). And for larger blobs, especially in cloud solutions where the disk is almost always attached over a network, the sheer number of pages you have to fetch (a 1 MiB blob will nominally consume 128 pages, modulo compression, row overhead, etc.) will add significant latency. All of this will also add pressure to actually useful pages that may be cached, so queries to more reasonable tables will be impacted as well.

RDBMS should not be used to store blobs; it's not a filesystem.

reply

upvote

by Quarrelsome3 hours ago|

[-]

I adore contiguous reads that ideas like that yield. I'd rather push that out to a read-only end point, then getting sucked into the entropy of treating what is effectively an unschema-ed blob into editable data.

reply