undefined

upvote

points

by wvenable7 hours ago |

upvote

by dspillett5 hours ago|

[-]

> I'm not sure why anyone would choose varchar for a column in 2026

The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.

Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.

Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.

--------

[0] Codes with fixed ranges, etc.

[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.

reply

upvote

by gfody57 minutes ago|

[-]

utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.

reply

upvote

by beart7 hours ago|

[-]

I agree with your first point. I've seen this same issue crop up in several other ORMs.

As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.

reply

upvote

by wvenable6 hours ago|

[-]

> The vast majority of character fields in databases I've worked with do not need to store unicode values.

This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.

reply

upvote

by SigmundA6 hours ago|

[-]

Vast majority of text fields I see are coded values that are perfectly fine using ascii, but I deal mostly with English language systems.

Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.

reply

upvote

by psidebot4 hours ago|

[-]

Some examples of coded fields that may be known to be ascii: order name, department code, business title, cost center, location id, preferred language, account type…

reply

upvote

by simonask5 hours ago|

[-]

English has plenty of Unicode — claiming otherwise is such a cliché…

Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.

reply

upvote

by Slothrop991 hours ago|

[-]

Just to be pedantic, those characters are in 'ANSI'/CP1252 and would be fine in a varchar on many systems.

Not that I disagree  Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.

reply

upvote

by zabzonk1 hours ago|

[-]

> Unicode is a requirement everywhere human language is used

Strange then how it was not a requirement for many, many years.

reply

upvote

by NegativeLatency5 hours ago|

[-]

Also less awkward to make it right the first time, instead of explaining why someone can’t type their name or an emoji

reply

upvote

by SigmundA3 hours ago|

[-]

Specifically not talking about a name field

reply

upvote

by SigmundA3 hours ago|

[-]

I am talking about coded values, like Status = 'A', 'B' or 'C'

Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.

reply

upvote

by wvenable1 hours ago|

[-]

It's way better to just use a DBMS that supports enums. I know SQL server isn't one of those but I still don't store my coded values as strings.

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by kstrauser1 hours ago|

[-]

Those are all single byte characters in UTF-8.

reply

upvote

by _3u107 hours ago|

[-]

Generally if it stores user input it needs to support Unicode. That said UTF-8 is probably a way better choice than UTF-16/UCS-2

reply

upvote

by Dwedit1 hours ago|

[-]

The one place UTF-16 massively wins is text that would be two bytes as UTF-16, but three bytes as UTF-8. That's mainly Chinese, Japanese, Korean, etc...

reply

upvote

by SigmundA6 hours ago|

[-]

UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

I have avoided it and have not followed if the issues are fully resolved, I would hope they are.

reply

upvote

by kstrauser6 hours ago|

[-]

> UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.

reply

upvote

by da_chicken2 hours ago|

[-]

I would be more critical of Microsoft choosing to support UCS-2/UTF-16 if Microsoft hadn't completed their implementation of Unicode support in the 90s and then been pretty consistent with it.

Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.

reply

upvote

by recursive3 hours ago|

[-]

In 92 it was a conference talk. In 98 it was adopted by the IETF. Point probably stands though.

reply

upvote

by swasheck3 hours ago|

[-]

the data types were introduced with SQL Server 7 (1998) so i’m not sure it’s accurate to state that it’s considered as the new thing.

reply

upvote

by SigmundA2 hours ago|

[-]

UTF-8 was introduced in SQL Server 2019:

https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-...

reply

upvote

by SigmundA6 hours ago|

[-]

To complicate matters SQL Server can do Nvarchar compression, but they should have just done UTF-8 long ago:

https://learn.microsoft.com/en-us/sql/relational-databases/d...

Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?

reply

upvote

by applfanboysbgon6 hours ago|

[-]

I think this is a rather pertinent showcase of the danger of outsourcing your thinking to LLMs. This article strongly indicates to me that it is LLM-written, and it's likely the LLM diagnosed the issue as being a C# issue. When you don't understand the systems you're building with, all you can do is take the plausible-sounding generated text about what went wrong for granted, and then I suppose regurgitate it on your LLM-generated portfolio website in an ostensible show of your profound architectural knowledge.

reply

upvote

by ziml775 hours ago|

[-]

This is not at all just an LLM thing. I've been working with C# and MS SQL Server for many years and never even considered this could be happening when I use Dapper. There's likely code I have deployed running suboptimally because of this.

And it's not like I don't care about performance. If I see a small query taking more than a fraction of a second when testing in SSMS or If I see a larger query taking more than a few seconds I will dig into the query plan and try to make changes to improve it. For code that I took from testing in SSMS and moved into a Dapper query, I wouldn't have noticed performance issues from that move if the slowdown was never particularly large.

reply

upvote

by cosmez6 hours ago|

[-]

This is a common issue, and most developers I worked with are not aware of it until they see the performance issues.

Most people are not aware of how Dapper maps types under the hood; once you know, you start being careful about it.

Nothing to do with LLMs, just plain old learning through mistakes.

reply

upvote

by keithnz5 hours ago|

[-]

actually, LLMs do way better, with dapper the LLM generates code to specify types for strings

reply

upvote

by SigmundA6 hours ago|

[-]

Yes I have run into this regardless of client language and I consider it a defect in the optimizer.

reply

upvote

by wvenable6 hours ago|

[-]

I wouldn't consider it a defect in the optimizer; it's doing exactly what it's told to do. It cannot convert an nvarchar to varchar -- that's a narrowing conversion. All it can do is convert the other way and lose the ability to use the index. If you think that there is no danger converting an nvarchar that contains only ASCII to varchar then I have about 70+ different collations that say otherwise.

reply

upvote

by 37 minutes ago|

[-]

deleted

reply

upvote

by SigmundA3 hours ago|

[-]

Can you give an example whats dangerous about converting a nvarchar with only ascii (0-127) then using the index otherwise fallback to a scan?

If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.

reply

upvote

by wvenable56 minutes ago|

[-]

For a literal/parameter that happens to be ASCII, a person might know it would fit in varchar, but the optimizer has to choose a plan that stays correct in the general case, not just for that one runtime value. By telling SQL server the parameter is a nvarchar value, you're the one telling it that might not be ASCII.

reply

upvote

by paulsutter5 hours ago|

[-]

Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

reply

upvote

by Dwedit1 hours ago|

[-]

It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.

reply

upvote

by wvenable5 hours ago|

[-]

Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.

So many problems could be solved with a time machine.

reply

upvote

by kstrauser4 hours ago|

[-]

The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.

The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".

reply

upvote

by wvenable10 minutes ago|

[-]

I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.

It's now 2026, everything always looks different in hindsight.

reply

upvote

by gpvos1 hours ago|

[-]

MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s.

reply

upvote

by kstrauser1 hours ago|

[-]

Yep. It would've been a better landing pad than UTF-16 since they had to migrate off UCS-2 anyway.

reply