Parse, Don't Validate (2019)

upvote

Parse, Don't Validate (2019)

(lexi-lambda.github.io)

218 points

by shirian8 hours ago |

upvote

by seanwilson7 hours ago|

[-]

Maybe I'm missing something and I'm glad this idea resonates, but it feels like sometime after Java got popular and dynamic languages got a lot of mindshare, a large chunk of the collective programming community forgot why strong static type checking was invented and are now having to rediscover this.

In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around. You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead to avoid writing defensive code everywhere e.g. a Date object that would throw an exception in the constructor if the string given didn't validate as a date (Edit: Changed this from email because email validation is a can of worms as an example). So there, "parse, don't validate" is the norm and not a tip/idea that would need to gain traction.

reply

upvote

by pjerem7 hours ago|

[-]

> In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around.

In 99% of the projects I worked on my professional life, anything that is coming from an human input is manipulated as a string and most of the time, it stays like this in all of the application layers (with more or less checks in the path).

On your precise exemple, I can even say that I never saw something like an "Email object".

reply

upvote

by jghn6 hours ago|

[-]

I've seen a mix between stringly typed apps and strongly typed apps. The strongly typed apps had an upfront cost but were much better to work with in the long run. Define types for things like names, email address, age, and the like. Convert the strings to the appropriate type on ingest, and then inside your system only use the correct types.

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by hathawsh3 hours ago|

[-]

Python has an "email object" that you should definitely use if you're going to parse email messages in any way.

https://docs.python.org/3/library/email.message.html

I imagine other languages have similar libraries. I would say static typing in scripting languages has arrived and is here to stay. It's a huge benefit for large code bases.

reply

upvote

by FranklinJabar6 hours ago|

[-]

> On your precise exemple, I can even say that I never saw something like an "Email object".

Well that's.... absolutely horrifying. Would you mind sharing what industry/stack you work with?

reply

upvote

by Terr_4 hours ago|

[-]

> horrifying.

IMO it's worth distinguishing between different points on the spectrum of "email object", ex:

1. Here is an Email object with detailed properties or methods for accessing its individual portions, changing things to/from canonical forms (e.g. lowercase Punycode domain names), running standard (or nonstandard) comparisons, etc.

2. Here is a immutable Email object which mainly wraps an arbitrary string, so that it isn't easily mixed-up with other notable strings we have everywhere.

__________

For e-mails in particular, implementing the first is a nightmare--I know this well from recent tasks fixing bad/subjective validation rules. Even if you follow every spec with inhuman precision and cleverness, you'll get something nobody will like.

In contrast, the second provides a lot of bang for your buck. It doesn't guarantee every Email is valid, but you get much better tools for tracing flows, finding where bad values might be coming from, and for implementing future validation/comparison rules (which might be context-specific) later when you decide you need to invest in them.

reply

upvote

by squeaky-clean5 hours ago|

[-]

The easiest and most robust way to deal with email is to have 2 fields. string email, bool isValidated. (And you'll need some additional way to handle a time based validation code). Accept the user's string, fire off an email to it and require them to click a validation link or enter a code somewhere.

Email is weird and ultimately the only decider of a valid email is "can I send email to this address and get confirmation of receipt".

If it's a consumer website you can so some clientside validation of ".@.\\..*" to catch easy typos. That will end up rejecting a super small amount of users but they can usually deal with it. Validating against known good email domains and whatnot will just create a mess.

reply

upvote

by lock15 hours ago|

[-]

In the spirit of "Parse, Don't Validate", rather than encode "validation" information as a boolean to be checked at runtime, you can define `Email { raw: String }` and hide the constructor behind a "factory function" that accepts any string but returns `Option<Email>` or `Result<Email,ParseError>`.

If you need a stronger guarantee than just a "string that passes simple email regex", create another "newtype" that parses the `Email` type further into `ValidatedEmail { raw: String, validationTime: DateTime }`.

While it does add some "boilerplate-y" code no matter what kind of syntactical sugar is available in the language of your choice, this approach utilizes the type system to enforce the "pass only non-malformed & working email" rule when `ValidatedEmail` type pops up without constantly remembering to check `email.isValidated`.

This approach's benefit varies depending on programming languages and what you are trying to do. Some languages offer 0-runtime cost, like Haskell's `newtype` or Rust's `repr(transparent)`, others carry non-negligible runtime overhead. Even then, it depends on whether the overhead is acceptable or not in exchange for "correctness".

reply

upvote

by squeaky-clean2 hours ago|

[-]

I would still usually prefer email as just a string and validation as a separate property, and they both belong to some other object. Unless you really only want to know if XYZ email exists, it's usually something more like "has it been validated that ABC user can receive email at XYZ address".

Is the user account validated? Send an email to their email string. Is it not validated? Then why are we even at a point in the code where we're considering emailing the user, except to validate the email.

You can use similar logic to what you described, but instead with something like User and ValidatedUser. I just don't think there's much benefit to doing it with specifically the email field and turning email into an object. Because in those examples you can have a User whose email property is a ParseError and you still end up having to check "is the email property result for this user type Email or type ParseError?" and it's very similar to just checking a validation bool except it's hiding what's actually going on.

reply

upvote

by mejutoco4 hours ago|

[-]

My preferred solution would be:

You have 2 types

UnvalidatedEmail

ValidatedEmail

Then ValidatedEmail is only created in the function that does the validation: a function that takes an UnvalidatedEmail and returns a ValidatedEmail or an error object.

reply

upvote

by squeaky-clean3 hours ago|

[-]

That can work in some situations. One thing I won't like about it in some other situations is that you now have 2 nullable fields associated with your user, or whatever that email is associated with. It's annoying or even impossible in a lot of systems to have a guaranteed validation that user.UnvalidatedEmail or user.ValidatedEmail must exist but not both.

reply

upvote

by mejutoco2 hours ago|

[-]

I see. In my example they would be just types and internally a newtype string.

So an object could have a field

email: UnvalidatedEmail | ValidatedEmail

Nothing would be nullable there in that case. You could match on the type and break if not all cases are handled.

reply

upvote

by cogman106 hours ago|

[-]

I've seen some devs prefer that route of programming and it very often results in performance problems.

An undiscussed issue with "everything is a string or dictionary" is that strings and dictionaries both consume very large amounts of memory. Particularly in a language like java.

A java object which has 2 fields in it with an int and a long will spend most of it's memory on the object header. You end up with an object that has 12 bytes of payload and 32bytes of object header (Valhala can't come soon enough). But when you talk about a HashMap in java, just the map structure itself ends up blowing way past that. The added overhead of 2 Strings for each of the fields plus a Java `Long` and `Integer` just decimates that memory requirement. It's even worse if someone decided to represent those numbers as Strings (I've seen that).

Beyond that, every single lookup is costly, you have to hash the key to lookup the value and you have to compare the key.

In a POJO, when you say "foo.bar", it's just an offset in memory that Java ends up doing. It's absurdly faster.

Please, for the love of god, if you know the structure of the data you are working with it, turn it into your language's version of a struct. Stop using dictionaries for everything.

reply

upvote

by ronjakoi5 hours ago|

[-]

I work with PHP, where classes are supposedly a lot slower than strings and arrays (PHP calls dictionaries "associative arrays").

reply

upvote

by cogman105 hours ago|

[-]

Benchmark it, but from what I can find this is dated advice. It might be faster on first load but it'd surprise me if it's always faster.

Edit: looking into how PHP has evolved, 8 added a JIT in 2021. That will almost certainly make it faster to use a class rather than an associative array. Associative arrays are very hard for a JIT to look through and optimize around.

reply

upvote

by esafak5 hours ago|

[-]

Obviously one where no-one who cared or knew better had any say.

reply

upvote

by Thaxll5 hours ago|

[-]

Trying to parse email will result in bad assumptions. Better be a plain string than a bad regex.

For examples many website reject + character, which is totally valid and gmail uses that for temporary emails.

Same for adresses.

reply

upvote

by jghn5 hours ago|

[-]

A lot of posts in this thread are conflating two separate but related topics. Statically typing a string as EmailAddress does not imply validating that the string in question is a valid email address. Both operations have their merits and downsides, but they don't need to be tied together.

Having a type wrapper of EmailAddress around a string with no business logic validation still allows me to take a string I believe to be an email address and be sure that I'm only passing it into function parameters that expect an email address. If I misorder my parameters and accidentally pass it to a parameter expecting a type wrapper of UserName, the compiler will flag it.

reply

upvote

by abnercoimbre5 hours ago|

[-]

Recently got a bank account which allowed my custom domain during registration, but rejected it as invalid during login. The problem? Their JS client code has a bad regex rejecting TLDs longer than 4 chars (trivial for a dev to bypass, but wow.)

reply

upvote

by tracker16 hours ago|

[-]

What's funny, is this is exactly one of the reasons I happen to like JavaScript... at its' core, the type coercion and falsy boolean rules work really well (imo) for ETL type work, where you're dealing with potentially untrusted data. How many times have you had to import a CSV with a bad record/row? It seems to happen all the time, why, because people use and manually manipulate data in spreadsheets.

In the end, it's a big part of why I tend to reach for JS/TS first (Deno) for most scripts that are even a little complex to attempt in bash.

reply

upvote

by rileymichael6 hours ago|

[-]

this is likely an ecosystem sort of thing. if your language gives you the tools to do so at no cost (memory/performance) then folks will naturally utilize those features and it will eventually become idiomatic code. kotlin value classes are exactly this and they are everywhere: https://kotlinlang.org/docs/inline-classes.html

reply

upvote

by gr4vityWall5 hours ago|

[-]

Haxe has a really elegant solution to this in the form of Abstracts[0][1]. I wonder why this particular feature never became popular in other languages, at least to my knowledge.

0 - https://code.haxe.org/category/abstract-types/color.html

1 - https://haxe.org/manual/types-abstract.html

reply

upvote

by Boxxed6 hours ago|

[-]

Well that's terrifying

reply

upvote

by mattmanser3 hours ago|

[-]

Clearly never worked in any statically typed language then.

Almost every project I've worked on has had some sort of email object.

Like I can't comprehend how different our programming experiences must be.

Everything is parsed into objects at the API layer, I only deal with strings when they're supposed to be strings.

reply

upvote

by eptcyka6 hours ago|

[-]

My condolences, I urge you to recover from past trauma and not let it prohibit a happy life.

reply

upvote

by krick4 hours ago|

[-]

At first I had a negative reaction to that comment and wanted to snap back something along the lines of "that's horrible" as well, but after thinking for a while, I decided that if I have anything to contribute to the discussion, I have to kinda sorta agree with you, and even defend you.

I mean, of course having a string, when you mean "email" or "date" is only slightly better than having a pointer, when you mean a string. And everyone's instinctive reaction to that should be that it's horrible. In practice though, not only did I often treat some complex business-objects and emails as strings, but (hold onto yourselves!) even dates as strings, and am ready to defend that as the correct choice.

Ultimately, it's about how much we are ready to assume about the data. I mean, that's what modelling is: making a set of assumptions about the real world and rejecting everything that doesn't fit our model. Making a neat little model is what every programmer wants. It's the "type-driven design" the OP praises. It's beautiful, and programmers must make beautiful models and write beautiful code, otherwise they are bad programmers.

Except, unfortunately, programming has nothing to do with beauty, it's about making some system that gets some data from here, displays it there and makes it possible for people and robots to act on the given data. Beautiful model is essentially only needed for us to contain the complexity of that system into something we can understand and keep working. The model doesn't truly need t be complete.

Moreover, as everyone with 5+ years of experience must known (I imagine), our models are never complete, it always turns out that assumptions we make are naïve it best. It turns out there was time before 1970, there are leap seconds, time zones, DST, which is up to minutes, not hours, and it doesn't necessarily happen on the same date every year (at least not in terms of Gregorian calendar, it may be bound to Ramadan, for example). There are so many details about the real world that you, brave young 14 (or 40) year old programmer don't know yet!

So, when you model data "correctly" and turn "2026-02-10 12:00" (or better yet, "10/02/2026 12:00") into a "correct" DateTime object, you are making a hell lot of assumptions, and some of them, I assure you, are wrong. Hopefully, it just so happens that it doesn't matter in your case, this is why such modelling works at all.

But what if it does? What if it's the datetime on a ticket that a third party provided to you, and you are providing it to a customer now? And you get sued if it ends up the wrong date because of some transformations that happened inside of your system? Well, it's best if it doesn't happen. Fortunately, no other computations in the system seem to rely on the fact it's a datetime right now, so you can just treat it as a string. Is it UTC? Event city timezone? Vendor HQ city timezone? I don't know! I don't care! That's what was on the ticket, and it's up to you, dear customer, to get it right.

So, ultimately, it's about where you are willing to put the boundary between your model and scary outer world, and, pragmatically, it's often better NOT to do any "type-driven design" unless you need to.

reply

upvote

by masklinn6 hours ago|

[-]

> it feels like sometime after Java got popular [...] a large chunk of the collective programming community forgot why strong static type checking was invented and are now having to rediscover this.

I think you have a very rose-tinted view of the past: while on the academic side static types were intended for proof on the industrial side it was for efficiency. C didn't get static types in order to prove your code was correct, and it's really not great at doing that, it got static types so you could account for memory and optimise it.

Java didn't help either, when every type has to be a separate file the cost of individual types is humongous, even more so when every field then needs two methods.

> In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around.

In most strong statically typed languages you would not, but in most statically typed codebases you would. Just look at the Windows interfaces. In fact while Simonyi's original "apps hungarian" had dim echoes of static types that got completely washed out in system, which was used widely in C++, which is already a statically typed language.

reply

upvote

by guerrilla6 hours ago|

[-]

> I think you have a very rose-tinted view of the past

I think they also forgot the entire Perl era.

reply

upvote

by esafak5 hours ago|

[-]

That's understandable. Youthful indiscretion is best forgotten.

reply

upvote

by zahlman1 hours ago|

[-]

I can still remember trying to deal with structured binary data in Perl, just because I didn't want to fiddle around with memory management in C. I'm not sure it was actually any less painful, and I ultimately abandoned that first attempt.

(Decades later, my "magnum opus" has been through multiple mental redesigns and unsatisfactory partial implementations. This time, for sure...)

reply

upvote

by thom1 hours ago|

[-]

My experience was that enterprise programmers burned out on things like WSDL at about the same time Rails became usable (or Django if you’re that way inclined). Rails had an excellent story for validating models which formed the basis for everything that followed, even in languages with static types - ASP.NET MVC was an attempt to win Rails programmers back without feeling too enterprisey. So you had these very convenient, very frameworky solutions that maybe looked like you were leaning on the type system but really it was all just reflection. That became the standard in every language, and nobody needed to remember “parse don’t validate” because heavy frameworks did the work. And why not? Very few error or result types in fancy typed languages are actually suited for showing multiple (internationalised) validation errors on a web page.

The bitter lesson of programming languages is that whatever clever, fast, safe, low-level features a language has, someone will come along and create a more productive framework in a much worse language.

Note, this framework - perhaps the very last one - is now ‘AI’.

reply

upvote

by chriswarbo5 hours ago|

[-]

> You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead to avoid writing defensive code everywhere e.g. a Date object that would throw an exception in the constructor if the string given didn't validate as a date

It's tricky because `class` conflates a lot of semantically-distinct ideas.

Some people might be making `Date` objects to avoid writing defensive code everywhere (since classes are types), but...

Other people might be making `Date` objects so they can keep all their date-related code in one place (since classes are modules/namespaces, and in Java classes even correspond to files).

Other people might be making `Date` objects so they can override the implementation (since classes are jump tables).

Other people might be making `Date` objects so they can overload a method for different sorts of inputs (since classes are tags).

I think the pragmatics of where code lives, and how the execution branches, probably have a larger impact on such decisions than safety concerns. After all, the most popular way to "avoid writing defensive code everywhere" is to.... write unsafe, brittle code :-(

reply

upvote

by munificent4 hours ago|

[-]

> You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead to avoid writing defensive code everywhere e.g.

There's nothing natural about this. It's not like we're born knowing good object-oriented design. It's a pattern that has to be learned, and the linked article is one of the well-known pieces that helped a lot of people understand this idea.

reply

upvote

by noelwelsh6 hours ago|

[-]

In 2 out of 3 problematic bugs I've had in the last two years or so were in statically typed languages where previous developers didn't use the type system effectively.

One bug was in a system that had an Email type but didn't actually enforce the invariants of emails. The one that caused the problem was it didn't enforce case insensitive comparisons. Trivial to fix, but it was encased in layers of stuff that made tracking it down difficult.

The other was a home grown ORM that used the same optional / maybe type to represent both "leave this column as the default" and "set this column to null". It should be obvious how this could go wrong. Easy to fix but it fucked up some production data.

Both of these are failures to apply "parse, don't validate". The form didn't enforce the invariants it had supposedly parsed the data into. The latter didn't differentiate two different parsing.

reply

upvote

by rzwitserloot6 hours ago|

[-]

that's a bit of a hairy situation. You're doing it wrong. Or not really, but.. complicated.

As per [RFC 5321](https://www.rfc-editor.org/rfc/rfc5321.html):

> the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address.

You're not allowed to do that. The email address `foo@bar.com` is identical to `foo@BAR.com`, but not necessarily identical to `FOO@bar.com`. If we're going to talk about 'commonly applied normalisations at most email providers', where do you draw that line? Should `foo+whatever@bar.com` be considered equal to `foo@bar.com`? That souds weird, except - that is exactly how gmail works, a couple of other mail providers have taken up that particular torch, and if your aim is to uniquely identify a 'recipient', you can hardcode that `a@gmail.com` and `a+whatever@gmail.com` definitely, guaranteed, end up at the same mailbox.

In practice, yes, users _expect_ that email addresses are case insensitive. Not just users, even - various intermediate systems apply the same incorrect logic.

This gets to an intriguing aspect of hardcoding types: You lose the flex, mostly. types are still better - the alternative is that you reliably attempt to write the same logic (or at least a call to some logic) to disentangle this mess every time you do anything with a string you happen to know is an email address which is terrible but gives you the option of intentionally not doing that if you don't want to apply the usual logic.

That's no way to program, and thus actual types and the general trend that comes with it (namely: We do this right, we write that once, and there is no flexibility left). Programming is too hard to leave room for exotic cases that programmers aren't going to think about when dealing with this concept. And if you do need to deal with it, it can still be encoded in the type, but that then makes visible things that in untyped systems are invisible (if my email type only has a '.compare(boolean caseSensitive)' style method, and is not itself inherently comparable because of the case sensitivity thing, that makes it _seem_ much more complicated than plain old strings. This is a lie - emails in strings *IS* complicated. They just are. You can't make that go away. But you can hide it, and shoving all data in overly generic data types (numbers and strings) tends to do that.

reply

upvote

by pja5 hours ago|

[-]

These days the world assumes that all parts of emails are case-insensitive, even if RFC5321 says otherwise. If it’s true for Google, Outlook & Apple mail then it’s basically true everywhere & everyone else has to get with the program.

If you don’t want to lose potentially important email then you need to make sure your own systems are case-insensitive everywhere. Otherwise you’ll find out the hard way when a customer or supplier is using a system that capitalises entire email addresses (yes, I have seen this happen) & you lose important messages.

reply

upvote

by jiehong51 minutes ago|

[-]

And then clojure enters: let’s keep few data structures but with tons of method.

So things stay as maps or arrays all the way through.

reply

upvote

by bcrosby957 hours ago|

[-]

In my experience that's pretty rare. Most people pass around string phone numbers instead of a phonenumber class.

Java makes it a pain though, so most code ends up primitive obsessed. Other languages make it easier, but unless the language and company has a strong culture around this, they still usually end up primitive obsessed.

reply

upvote

by vips7L7 hours ago|

[-]

    record PhoneNumber(String value) {}

Huge pain.

reply

upvote

by jonathanlydall4 hours ago|

[-]

I’m very much a proponent of statically typed languages and primarily work in C#.

We tried “typed” strings like this on a project once for business identifiers.

Overall it worked in making sure that the wrong type of ID couldn’t accidentally be used in the wrong place, but the general consensus after moving on from the project was that the “juice was not worth the squeeze”.

I don’t know if other languages make it easier, but in c# it felt like the language was mostly working against you. For example data needs to come in and out over an API and is in string form when it does, meaning you have to do manual conversions all the time.

In c# I use named arguments most of the time, making it much harder to accidentally pass the wrong string into a method or constructor’s parameter.

reply

upvote

by kleiba6 hours ago|

[-]

What have you gained?

reply

upvote

by munk-a6 hours ago|

[-]

Without any other context? Nothing - it's just a type alias...

But the context this type of an alias should exist in is one where a string isn't turned into a PhoneNumber until you've validated it. All the functions taking a string that might end up being a PhoneNumber need to be highly defensive - but all the functions taking a PhoneNumber can lean on the assumptions that go into that type.

It's nice to have tight control over the string -> PhoneNumber parsing that guarantees all those assumptions are checked. Ideally that'd be done through domain based type restrictions, but it might just be code - either way, if you're diligent, you can stop being defensive in downstream functions.

reply

upvote

by seanwilson6 hours ago|

[-]

> All the functions taking a string that might end up being a PhoneNumber need to be highly defensive

Yeah, I can't relate at all with not using a type for this after having to write gross defensive code a couple of times e.g. if it's not a phone number you've got to return undefined or throw an exception? The typed approach is shorter, cleaner, self-documenting, reduces bugs and makes refactoring easier.

reply

upvote

by thfuran6 hours ago|

[-]

>But the context this type of an alias should exist in is one where a string isn't turned into a PhoneNumber until you've validated it.

Even if you don't do any validation as part of the construction (and yeah, having a separate type for validated vs unvalidated is extremely helpful), universally using type aliases like that pretty much entirely prevents the class of bugs from accidentally passing a string/int typed value into a variable of the wrong stringy/inty type, e.g. mixing up different categories of id or name or whatever.

reply

upvote

by barmic126 hours ago|

[-]

one issue is it’s not a type alias but a type encapsulation. This have a cost at runtime, it’s not like in some functionnals languages a non cost abstraction.

reply

upvote

by vips7L4 hours ago|

[-]

Correctness is more important than runtime costs.

reply

upvote

by esafak5 hours ago|

[-]

Validation, readability, and prevention of accidentally passing in the wrong string (e.g., by misordering two strings arguments in a function).

reply

upvote

by jalk6 hours ago|

[-]

An explicit type

reply

upvote

by dylan6046 hours ago|

[-]

Obviously the pseudo code leaves to the imagination, but what benefits does this give you? Are you checking that it is 10-digits? Are you allowing for + symbols for the international codes?

reply

upvote

by xboxnolifes1 hours ago|

[-]

You have functions

    void callNumber(string phoneNumber);
    void associatePhoneNumber(string phoneNumber, Person person);
    Person lookupPerson(string phoneNumber);
    Provider getProvider(string phoneNumber);

I pass in "555;324+289G". Are you putting validation logic into all of those functions? You could have a validation function you write once and call in all of those functions, but why? Why not just parse the phone number into an already validated type and pass that around?

    PhoneNumber PhoneNumber(string phoneNumber);
    void callNumber(PhoneNumber phoneNumber);
    void associatePhoneNumber(PhoneNumber phoneNumber, Person person);
    Person lookupPerson(PhoneNumber phoneNumber);
    Provider getProvider(PhoneNumber phoneNumber);

Put all of the validation logic into the type conversion function. Now you only need to validate once from string to PhoneNumber, and you can safely assume it's valid everywhere else.

reply

upvote

by flqn6 hours ago|

[-]

Can't pass a PhoneNumber to a function expecting an EmailAddress, for one, or mix up the order of arguments in a function that may otherwise just take two or more strings

reply

upvote

by munk-a6 hours ago|

[-]

That's going to be up to the business building the logic. Ideally those assumptions are clearly encoded in an easily readable manner but at the very least they should be captured somewhere code adjacent (even if it's just a comment and the block of logic to enforce those restraints).

reply

upvote

by bjghknggkk6 hours ago|

[-]

How to make a crap system that users will hate: Let some architecture astronaut decide what characters should be valid or not.

reply

upvote

by JambalayaJimbo6 hours ago|

[-]

If you are not checking that the phone number is 10 digits (or whatever the rules are for the phone number for your use case), it is absolutely pointless. But why would you not?

reply

upvote

by jghn6 hours ago|

[-]

I would argue it's the other way around. If I take a string I believe to be a phone number and wrap it in a `PhoneNumber` type, and then later I try to pass it in as the wrong argument to a function like say I get order of name & phone number reversed, it'll complain. Whereas if both name & phone number are strings, it won't complain.

That's what I see as the primary value to this sort of typing. Enforcing the invariants is a separate matter.

reply

upvote

by bjghknggkk6 hours ago|

[-]

And parentheses. And spaces (that may, or may not, be trimmed). And all kind of unicode equivalent characters, that might have to be canonicalized. Why not treat it as a byte buffer anyway.

reply

upvote

by waynesonfire6 hours ago|

[-]

What did you lose?

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by css_apologist6 hours ago|

[-]

This is an idea that is not ON or OFF

You can get ever so gradually stricter with your types which means that the operations you perform on on a narrow type is even more solid

It is also 100% possible to do in dynamic languages, it's a cultural thing

reply

upvote

by Archelaos6 hours ago|

[-]

Strong static type checking is helpful when implementing the methodology described in this article, but it is besides its focus. You still need to use the most restrictive type. For example, uint, instead of int, when you want to exclude negative values; a non-empty list type, if your list should not be empty; etc.

When the type is more complex, specific contraints should be used. For a real live example: I designed a type for the occupation of a hotel booking application. The number of occupants of a room must be positiv and a child must be accompanied by at least one adult. My type Occupants has a constructor Occupants(int adults, int children) that varifies that condition on construction (and also some maximum values).

reply

upvote

by imtringued5 hours ago|

[-]

Using uint to exclude negative values is one of the most common mistakes, because underflow wrapping is the default instead of saturation. You subtract a big number from a small number and your number suddenly becomes extremely large. This is far worse than e.g. someone having traveled a negative distance.

reply

upvote

by Archelaos4 hours ago|

[-]

In C# I use the 'checked' keyword in this or similar cases, when it might be relevant: c = checked(a - b);

Note that this does not violate the "Parse, Don't Validate" rule. This rule does not prevent you from doing stupid things with a "parsed" type.

In other cases, I use its cousin unchecked on int values, when an overflow is okay, such as in calculating an int hash code.

reply

upvote

by jackpirate6 hours ago|

[-]

> Edit: Changed this from email because email validation is a can of worms as an example

Email honestly seems much more straightforward than dates... Sweden had a Feb 30 in 1712, and there's all sorts of date ranges that never existed in most countries (e.g. the American colonies skipped September 3-13 in 1752).

reply

upvote

by flqn6 hours ago|

[-]

Dates are unfortunate in that you can only really parse them reliably with a TZDB.

reply

upvote

by conartist66 hours ago|

[-]

I think you're quite right that the idea of "parse don't validate" is (or can be) quite closely tied to OO-style programming.

Essentially the article says that each data type should have a single location in code where it is constructed, which is a very class-based way of thinking. If your Java class only has a constructor and getters, then you're already home free.

Also for the method to be efficient you need to be able to know where an object was constructed. Fortunately class instances already track this information.

reply

upvote

by brooke2k6 hours ago|

[-]

this is very much a nitpick, but I wouldn't call throwing an exception in the constructor a good use of static typing. sure, it's using a separate type, but the guarantees are enforced at runtime

reply

upvote

by munificent4 hours ago|

[-]

I wouldn't call it a good use of static typing, but I'd call it a good use of object-oriented programming.

This is one of the really key ideas behind OOP that tends to get overlooked. A constructor's job is to produce a semantically valid instance of a class. You do the validation during construction so that the rest of the codebase can safely assume that if it can get its hands on a Foo, it's a valid Foo.

reply

upvote

by zanecodes5 hours ago|

[-]

Given that the compiler can't enforce that users only enter valid data at compile time, the next best thing is enforcing that when they do enter invalid data, the program won't produce an `Email` object from it, and thus all `Email` objects and their contents can be assumed to be valid.

reply

upvote

by mh22665 hours ago|

[-]

This is all pretty language-specific and I think people may end up talking past each other.

Like, my preferred alternative is not "return an invalid Email object" but "return a sum type representing either an Email or an Error", because I like languages with sum types and pattern matching and all the cultural aspects those tend to imply.

But if you are writing Python or Java, that might look like "throw an exception in the constructor". And that is still better than "return an Email that isn't actually an email".

reply

upvote

by zanecodes4 hours ago|

[-]

Ah yeah, I guess I assumed by the use of the term "contructor" that GP meant a language like Python or Java, and in some cases it can difficult to prevent misuse by making an unsafe constructor private and only providing a public safe contructor that returns a sum type.

I definitely agree returning a sum type is ideal.

reply

upvote

by imtringued5 hours ago|

[-]

I agree and for several reasons.

If you have onerous validation on the constructor, you will run into extremely obvious problems during testing. You just want a jungle, but you also need the ape and the banana.

reply

upvote

by mh22664 hours ago|

[-]

What big external dependencies do you need for a parser?

`String -> Result<Email, Error>` shouldn't need any other parameters?

But you should ideally still have some simple field-wise constructor (whatever that means, it's language-dependent) anyways, the function from String would delegate to that after either extracting all of the necessary components or returning/throwing an error.

reply

upvote

by yakshaving_jgt7 hours ago|

[-]

It's a design choice more than anything. Haskell's type safety is opt-in — the programmer has to actually choose to properly leverage the type system and design their program this way.

reply

upvote

by wat100006 hours ago|

[-]

I'm not sure, maybe a little bit. My own journey started with BASIC and then C-like languages in the 80s, dabbling in other languages along the way, doing some Python, and then transitioning to more statically typed modern languages in the past 10 years or so.

C-like languages have this a little bit, in that you'll probably make a struct/class from whatever you're looking at and pass it around rather than a dictionary. But dates are probably just stored as untyped numbers with an implicit meaning, and optionals are a foreign concept (although implicit in pointers).

Now, I know that this stuff has been around for decades, but it wasn't something I'd actually use until relatively recently. I suspect that's true of a lot of other people too. It's not that we forgot why strong static type checking was invented, it's that we never really knew, or just didn't have a language we could work in that had it.

reply

upvote

by macintux7 hours ago|

[-]

A frequent visitor to HN. Tip: if you click on the "past" link under the title (but not the "past" link at the top of the page), you'll trigger a search for previous posts.

https://hn.algolia.com/?query=Parse%2C%20Don%27t%20Validate&...

However, it's more effective to throw quotes into the mix, reduces false positives.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

reply

upvote

by tlavoie19 minutes ago|

[-]

Along with all the general discussion, I found the concept of defensive parsing striking a chord when reading this as well: "The Seven Turrets of Babel: A Taxonomy of LangSec Errors and How to Expunge Them", https://langsec.org/papers/langsec-cwes-secdev2016.pdf

I'd love for these ideas to take hold at work, but I'm on the fringes in infosec, not a dev.

reply

upvote

by zdw7 hours ago|

[-]

This is a great article, but people often trip over the title and draw unusual conclusions.

The point of the article is about locality of validation logic in a system. Parsing in this context can be thought as consolidating the logic that makes all structure and validity determination about incoming data into one place in the program.

This lets you then rely on the fact that you have valid data in a known structure in all other parts of the program, which don't have to be crufted up with validation logic when used.

Related, it's worth looking at tools that further improve structure/validity locality like protovalidate for protobuf, or Schematron for XML, which allow you to outsource the entire validity checking to library code for existing serialization formats.

reply

upvote

by jmholla6 hours ago|

[-]

When I came to this idea on my own, I called it "translation at the edge." But for me it was more that just centralizing data validation, it also was about giving you access to all the tools your programming language has for manipulating data.

My main example was working with a co-worker whose application used a number of timestamps. They were passing them around as strings and parsing and doing math with them at the point of usage. But, by parsing the inputs into the language's timestamp representation, their internal interfaces were much cleaner and their purpose was much more obvious since that math could be exposed at the invocation and not the function logic, and thus necessarily, through complex function names.

reply

upvote

by solomonb6 hours ago|

[-]

I disagree. I think the key insight is to carry the proof with you in the structure of the type you 'parse' into.

reply

upvote

by zdw5 hours ago|

[-]

Could you clarify what you mean by "carry the proof"?

reply

upvote

by Jtsummers5 hours ago|

[-]

Let's say you have the example from the article of wanting a non-empty list, but you don't use the NonEmpty type and instead are just using an ordinary list. As functions get called that require the NonEmpty property, they either have to trust that the data was validated earlier or perform the validation themselves. The data and its type carry no proof that it is, in fact, non-empty.

If you instead parse the data (which includes a validation step) and produce a Maybe NonEmpty, if the result is a Just NonEmpty (vs Nothing) you can pass around the NonEmpty result to all the calls and no more validation ever needs to occur in the code from that point on, and you obviously reject it rather than continue if the result is Nothing. Once you have a NonEmpty result, you have a proof (the type itself) that is carried with it in the rest of the program.

reply

upvote

by hutao50 minutes ago|

[-]

Typed functional programming has the perspective that types are like propositions and their values are proofs of that proposition. For example, the product type A * B encodes logical conjunction, and having a pair with its first element of type A and its second element of type B "proves" the type signature A * B. Similarly, the NonEmpty type encodes the property that at least one element exists. This way, the program is "correct by construction."

This types-are-propositions persoective is called the Curry-Howard correspondence, and it relates to constructive mathematics (wherein all proofs must provide an algorithm for finding a "witness" object satisfying the desired property).

reply

upvote

by solomonb5 hours ago|

[-]

From the article:

    validateNonEmpty :: [a] -> IO ()
    validateNonEmpty (_:_) = pure ()
    validateNonEmpty [] = throwIO $ userError "list cannot be empty"
    
    parseNonEmpty :: [a] -> IO (NonEmpty a)
    parseNonEmpty (x:xs) = pure (x:|xs)
    parseNonEmpty [] = throwIO $ userError "list cannot be empty"

Both consolidate all the invariants about your data; in this example there is only one invariant but I think you can get the point. The key difference between the "validate" and "parse" versions is that the structure of `NonEmpty` carries the proof that the list is not empty. Unlike the ordinary linked list, by definition you cannot have a nil value in a `NonEmpty` and you can know this statically anywhere further down the call stack.

reply

upvote

by munk-a6 hours ago|

[-]

I think that's an excellent way to build a defensive parsing system but... I still want to build that and then put a validator in front of it to run a lot of the common checks and make sure we can populate easy to understand (and voluminus) errors to the user/service/whatever. There is very little as miserable as loading a 20k CSV file into a system and receiving "Invalid value for name on line 3" knowing that there are likely a plethora of other issues that you'll need to discover one by one.

reply

upvote

by dang5 hours ago|

[-]

Related. Others?

Parse, Don't Validate (2019) - https://news.ycombinator.com/item?id=41031585 - July 2024 (102 comments)

Parse, don't validate (2019) - https://news.ycombinator.com/item?id=35053118 - March 2023 (219 comments)

Parse, Don't Validate (2019) - https://news.ycombinator.com/item?id=27639890 - June 2021 (270 comments)

Parse, Don’t Validate - https://news.ycombinator.com/item?id=21476261 - Nov 2019 (230 comments)

Parse, Don't Validate - https://news.ycombinator.com/item?id=21471753 - Nov 2019 (4 comments)

reply

upvote

by r4victor6 hours ago|

[-]

It seems modern statically-typed and even dynamically-typed languages all adopted this idea, except Go, where they decided zero values represent valid states always (or mostly).

A sincere question to Go programmers – what's your take on "Parse, Don't Validate"?

reply

upvote

by taylorallred6 hours ago|

[-]

Not speaking for all Go programmers, but I think there is a lot of merit in the idea of "making zero a meaningful value". Zero Is Initialization (ZII) is a whole philosophy that uses this idea. Also, "nil-punning" in Clojure is worth looking at. Basically, if you make "zero" a valid state for all types (the number 0, an empty array, a null pointer) then you can avoid wrapping values in Option types and design your code for the case where a block of memory is initialized to zero or zeroed out.

reply

upvote

by masklinn5 hours ago|

[-]

Only if you ignore the billion cases where it doesn't work, such that half the standard library explodes if you try to use it with zero values because they make no sense[0], special mention to reflect.Value's

> Panic: call of reflect.Value.IsZero on zero Value

And the "cool" stuff like database/sql's plethora of Null* for every single type it can support. So you're not really avoiding "wrapping values in Option types", you're instead copy/pasting ad-hoc ones all over, and have to deal with zero values in places where they have no reason to be, forced upon you by the language.

And then of course it looks even worse because... not having universal default values doesn't preclude having opt-in default values. So when that's useful and sensible your type gets a default value, and when it's not it doesn't, and that avoids having to add a check in every single method so your code doesn't go off the rail when it encounters a nonsensical zero value.

[0] or even when you might think it does, like a nil Logger or Handler

reply

upvote

by r4victor5 hours ago|

[-]

That's exactly the problem. Thanks for describing! What I find is people using linters to ensure all struct fields are initialized explicitly (e.g. https://github.com/GaijinEntertainment/go-exhaustruct), which is uhh...

reply

upvote

by mh22665 hours ago|

[-]

I mean, yeah, you can avoid wrapping things that are optional in Option, but that doesn't make that the result is semantically meaningful. If you have a User struct with "age: 0" does that mean:

1. The user being referred to is a newborn, or:

2. Some other code, locally or on a server you're reading a response from, broke something and forgot to set the age field.

It's not just Rust that gets this (in my opinion) right, Kotlin and Swift also do this well at both a language and serialization library (kotlinx-serialization and Codable) level, but this quote from this article comparing Go and Rust is generally how I think about it:

> If you’re looking to reduce the whole discourse to “X vs Y”, let it be “serde vs crossing your fingers and hoping user input is well-formed”. It is one of the better reductions of the problem: it really is “specifying behavior that should be allowed (and rejecting everything else)” vs “manually checking that everything is fine in a thousand tiny steps”, which inevitably results in missed combinations because the human brain is not designed to hold graphs that big.

https://fasterthanli.me/articles/i-want-off-mr-golangs-wild-...

Basically, if you do not want it to be wrapped in Option then you should either:

1. explicitly specify a default—which can, but doesn't have to be 0.

2. fail immediately (as this article explains) when trying to parse some data that does not have that field set.

and if you do want an Option, you just need to be explicit about that.

reply

upvote

by kubanczyk5 hours ago|

[-]

> what's your take on "Parse, Don't Validate"

Always aspire to that. Translating that to Go conventions, the constructor has to have signature like:

    func NewT() (T, error) {
      ...
    }

Such signatures exist in the stdlib, e.g. https://cs.opensource.google/go/go/+/refs/tags/go1.25.7:src/... although I've met old-hands that were surprised by it.

In larger codebases, I've noticed an emergent phenomenon that usually the T{} itself (bypassing NewT constructor) tends to be unusable anyway, hence the constructor will enforce "parse, don't validate" just well enough. Only very trivial T{} won't have a nilable private field, such as a pointer, func, or chan.

I'd say that "making zero a meaningful value" does not scale well when codebase grows.

reply

upvote

by throw567643u81 hours ago|

[-]

What's lexi up to these days? Her last big contribution to Haskell was the delimited continuation primops, then she disappeared in a puff of smoke.

reply

upvote

by d0liver5 hours ago|

[-]

I think, more generally, "push effects to the edges" which includes validation effects like reporting errors or crashing the program. If you, hypothetically, kept all of your runtime data in a big blob, but validated its structure right when you created it, then you could pass around that blob as an opaque representation. You could then later deserialize that blob and use it and everything would still be fine -- you'd just be carrying around the validation as a precondition rather than explicitly creating another representation for it. You could even use phantom types to carry around some of the semantics of your preconditions.

Point being: I think the rule is slightly more general, although this explanation is probably more intuitive.

reply

upvote

by jmull4 hours ago|

[-]

Systems tend to change over time (and distributed nodes of a system don’t cut over all at once). So what was valid when you serialized it may not be valid when you deserialize it later.

reply

upvote

by d0liver4 hours ago|

[-]

This issue exists with the parsed case, too. If you're using a database to store data, then the lifecycle of that data is in question as soon as it's used outside of a transaction.

We know that external systems provide certain guarantees, and we rely on them and reason about them, but we unfortunately cannot shove all of our reasoning into the type system.

Indeed, under the hood, everything _is_ just a big blob that gets passed around and referenced, and the compiler is also just a system that enforces preconditions about that data.

reply

upvote

by hackrmn2 hours ago|

[-]

This article has done rounds on the ITernet before. Maybe because it resonates with people (who repost it time and again). Anyway, I very much agree with the idea. In my experience, "text" or "string" is not a type. Technically it is one, of course, but I seldom see good use of it for when a more apt type would do better -- in short, it's a last resort thing, and it fares badly there too. Ironically, the only good use for it is as input to a... parser.

I see a lot of URLs being passed around as strings within a system perfectly capable of leveraging typing theory and offering user defined types, if not at least through OOP goodness a lot of people would furiously defend. The URL, in this case, would often have _already_ been parsed once, but effectively "unparsed" and keeps being sent around as text in need of parsing at every "junction" of the system that requires to meaningfully access it, except that parsing is approached like some ungodly litany best avoided and thus foregone or lazily implemented with a regex where a regex isn't nearly sufficient. Perhaps it's because we lack parsers, by and large, or in the very least parser generators that are readily available, understandable (to your average developer), and simple enough to use without requiring to understand formal language theory with Chomsky hierarchy, context sensitivity, grammar ambiguity and parse forests, to say the least.

Same with [file] paths, HTTP header values, and other things that seem alluring to dismiss as only being text.

It wouldn't be a problem, had I not seen time and again how the "text" breaks -- URLs with malformed query parameters because why not just do `+ '?' + entries.map(([ name, value ]) => name + "=" + value).join("&")`, how hard can it be? Paths that assume leading slash or lack there of etc.

I believe the article was born precisely of the same class of frustrations. So I am now bringing the same mantra everywhere with me: "There is no such type as string". Parse at earliest opportunity, lazily if the language allows it (most languages do) -- breadth first so as to not pay upfront, just don't let the text slip through.

I am talking from experience, really, your mileage may vary.

reply

upvote

by pcwelder7 hours ago|

[-]

Each repost is worth it.

This, along with John Ousterhout's talk [1] on deep interfaces was transformational for me. And this is coming from a guy who codes in python, so lots of transferable learnings.

[1] https://www.youtube.com/watch?v=bmSAYlu0NcY

reply

upvote

by kayo_202110306 hours ago|

[-]

A great piece.

Unfortunately, it's somewhat of a religious argument about the one true way. I've worked on both sides of the fence, and each field is equally green in its own way. I've use OCaml, with static typing, and Clojure, with maybe-opt-in schema checking. They both work fine for real purposes.

The big problem arrives when you mix metaphors. With typing, you're either in, or you're out - or should be. You ought not to fall between stools. Each point of view works fine, approached in the right way, but don't pretend one thing is the other.

reply

upvote

by rorylaitila5 hours ago|

[-]

I make great use of value objects in my applications but there are things I needed to do to make it ergonomic/performant. A "small" application of mine has over 100 value objects implemented as classes. Large apps easily get into the 1000s of classes just for value objects. That is a lot of boilerplate. It's a lot of boxing/unboxing. It'd be a lot of extra typing than "stringly typed" programs.

To make it viable, all value objects are code-generated from model schemas, and then customized as needed (only like 5% need customization beyond basic data types). I have auto-upcasting on setters so you can code stringly when wanted, but everything is validated (very useful for writing unit tests more quickly). I only parse into types at boundaries or on writes/sets, not on reads/gets (limit's the amount of boxing, particularly on reading large amounts of data). Heavy use of reflection, and auto-wiring/dependency injection.

But with these conventions in place, I quite enjoy it. Easy to customize/narrow a type. One convention for all validation. External inputs are by default secure with nice error messages. Once place where all values validation happens (./values classes folder).

reply

upvote

by exodys2 hours ago|

[-]

Maybe I am being contrarian, or maybe I don't understand; if I am reading input, I am always going to validate that input after parsing. Especially if it is from a user.

I understand that they should be separate, but they should be very close together.

reply

upvote

by Jtsummers1 hours ago|

[-]

> if I am reading input, I am always going to validate that input after parsing.

In the "parse, don't validate" mindset, your parsing step is validation but it produces something that doesn't require further validation. To stick with the non-empty list example, your parse step would be something like:

  parse [h|t] = Just h :| t
  parse []    = Nothing

So when you run this you can assume that the data is valid in the rest of the code (sorry, my Haskell is rusty so this is a sketch, not actual code):

  process data =
    do {
      Just valid <- parse data;
      ... further uses of valid that can assume parsing succeeded, if it didn't an error would already have occurred and you can handle it
    }

That has performed validation, but by parsing it also produces a value that doesn't require any revalidation. Every function that takes the parsed data as an argument can ignore the possibility that the data is invalid. If all you do is validate (returning true/false):

  validate [h|t] = true
  validate []    = false

Then you don't have that same guarantee. You don't know that, in future uses, that the data is actually valid. So your code becomes more complex and error-prone.

  process data =
    if validate data then use data else fail "Well shit"

  use [h|t] = do_something_with h t
  use []    = fail "This shouldn't have happened, we validated it right? Must have been called without data being validated first."

The parse approach adds a guarantee to your code, that when you reach `use` (or whatever other functions) with parsed and validated data that you don't have to test that property again. The validate approach does not provide this guarantee, because you cannot guarantee that `use` is never called without first running the validation. There is no information in the program itself saying that `use` must be called after validation (and that validation must return true). Whereas a version of `use` expecting NonEmpty cannot be called without at least validating that particular property.

reply

upvote

by exodys45 minutes ago|

[-]

Ah, I get it. So, it's just a tagging system. Once tagged, assume valid. DRY.

reply

upvote

by mrkeen48 minutes ago|

[-]

Suppose you're receiving bytes representing a User at the edge of your system. If you put json bytes into your parser and get back a User, then put your User through validation, that means you know there are both 'valid' Users and 'invalid' Users.

Instead, there should simply be no way to construct an invalid User. But this article pushes a little harder than that:

Does your business logic require a User to have exactly one last name, and one-or-more first names? Some people might go as far as having a private-constructor + static-factory-method create(..), which does the validation, e.g.

  class User {
    private List<String> names;
    private User(List<String> names) {..}
    public static User create(List<String> names) throws ValidationException {
       // Check for name rules here
    }
  }

Even though the create(..) method above validates the name rules, you're still left holding a plain old List-of-Strings deeper in the program when it comes time to use them. The name rules were validated and then thrown away! Now do you check them when you go to use them? Maybe?

If you encode your rules into your data-structure, it might look more like:

  class User {
      String lastName;
      NeList<String> firstNames;
      private User(List<String> names) throws ValidationException {..}
  }

If I were doing this for real, I'd probably have some Name rules too (as opposed to a raw String). E.g. only some non-empty collection of utf8 characters which were successfully case-folded or something.

Is this overkill? Do I wind up with too much code by being so pedantic? Well no! If I'm building valid types out of valid types, perhaps the overall validation logic just shrinks. The above class could be demoted to some kind of struct/record, e.g.

  record User(Name lastName, NeList<Name> firstNames);

Before I was validating Names inside User, but now I can validate Names inside Name, which seems like a win:

  class Name {
      private String value;
      private Name (String name) throws ValidationException {..}
  }

reply

upvote

by 1-more4 hours ago|

[-]

A related talk is Richard Feldman's "Making Impossible States Impossible." Richard wrote a number of Elm packages and is the creator of the Roc language.

https://www.youtube.com/watch?v=IcgmSRJHu_8

reply

upvote

by sevensor5 hours ago|

[-]

Making illegal states unrepresentable sounds like a great idea, and it is, but I see it getting applied without nuance. “Has multiple errors” can be a valid type. Instead of bailing immediately, you can collect all of the errors so that they can be reported all together rather than forcing the user to fix one error at a time.

reply

upvote

by mh22665 hours ago|

[-]

Is this not `Result<Whatever, List<Error>>`? There's nothing enforcing that the error side needs to be the value-based equivalent of a single instance of an Exception class.

The important part is not to expose a "String -> Whatever" function publicly.

reply

upvote

by mmis10004 hours ago|

[-]

This article always end up relevant once in a while.

Recently, I am trying to make llm to output specific format.

It turns out no matter how you wrote propmt and perform validate. It will never be as effective as just limit the output with proper bnf (via llama cpp grammar file).

reply

upvote

by cbondurant3 hours ago|

[-]

A really mindset-altering read for me, I've carried this way of thinking ever since I'd first read it a few years ago.

reply

upvote

by Joel_Mckay4 hours ago|

[-]

An unconstrained json/bson parser without recursive structure limits must be bounded somehow. In many cases, the ordering of marshaled data cannot be guaranteed across platforms.

The best method is walk the symbolic tree with a cost function, and score the fitness of the data compared to expected structures. For example, mismatched or duplicate GUID/Account/permission/key fields reroute the message to the dead-letter queue for analysis, missing required fields trigger error messaging, and missing optional fields lower the qualitative score of the message content.

Parsers can be extremely unpredictable, and loosely typed formats are dangerous at times. =3

reply

upvote

by gaigalas5 hours ago|

[-]

> Now I have a single, snappy slogan that encapsulates what type-driven design means to me, and better yet, it’s only three words long

IMHO this is distracting and sort of vain. It forces this "semantics" perspective into the reader, just so the author can have a snappy slogan.

Also, not all languages have such freedom in type expressiveness. Some of them have but offer terrible trade-ofs.

The truth is, if you try to be that expressive in a language that doesn't support it you'll end up with a horror story. The article fails to mention that, and that "snappy slogan" makes it look like it's an absolute claim that you must internalize, some sort of deep truth that applies everywhere. It isn't.

reply

upvote

by metalliqaz5 hours ago|

[-]

bonus points for the correct use of "cromulent"

reply

upvote

by yakshaving_jgt7 hours ago|

[-]

I did a lightning talk on this topic last year, with a concrete example in Yesod.

https://www.youtube.com/watch?v=MkPtfPwu3DM

reply

upvote

by curiousgal7 hours ago|

[-]

Semi tangent but I am curious. for those with more experience in python, do you just pass around generic Pandas Dataframes or do you parse each row into an object and write logic that manipulates those instead?

reply

upvote

by tomtom13375 hours ago|

[-]

Definitely do not parse each row into eg pydantic models. You lose the entire performance benefit of pandas / polars by doing this.

If you need it, use a dataframe validation library to ensure that values are within certain ranges.

There are not yet good, fast implementations of proper types in Python dataframes (or databases for that matter) that I am aware of.

reply

upvote

by lmeyerov6 hours ago|

[-]

Pass as immutable values, and try to enforce schema (eg, arrow) to keep typed & predictable. This is generally easy by ensuring initial data loads get validated, and then basic testing of subsequent operations goes far.

If python had dependent types, that's how i'd think about them, and keeping them typed would be even easier, eg, nulls sneaking in unexpectedly and breaking numeric columns

When using something like dask, which forces stronger adherence to typings, this can get more painful

reply

upvote

by adammarples6 hours ago|

[-]

Speaking personally, I try not to write code that passes around dataframes at all. I only really want to interact with them when I have to in order to read/write parquet.

reply

upvote

by whalesalad6 hours ago|

[-]

The circumstances where you would use one or the other are vastly different. A dataframe is an optimized datastructure for dealing with columnar data, filtering, sorting, aggregating, etc. So if that is what you are dealing with, use a dataframe.

The goal is more about cleaning and massaging data at the perimeter (coming in, and going out) versus what specific tool (a collection of objects vs a dataframe) is used.

reply

upvote

by whalesalad5 hours ago|

[-]

The author's point here is great, but the post does (imho) a poor job illustrating it.

The tl;dr on this is: stop sprinkling guards and if statements all over your codebase. Convert (parse) the data into truthful objects/structs/containers at the perimieter. The goal is to do that work at the boundaries of your system, so that inside of your system you can stop worrying about it and trust the value objects you have.

I think my hangup here is on the use of terms parse vs validate. They are not the right terms to describe this.

reply

upvote

by tialaramex5 hours ago|

[-]

I understand where you're coming from, but these terms seem fine to me:

This is exactly what, for example, Rust's str::parse method is for. The documentation gives the example:

    let four: u32 = "4".parse().unwrap();

You will so very often have text and want typed information, and parse is exactly how we do that transformation exactly once. Whereas validation is what it looks like when we try to make piecemeal checks later.

reply

upvote

by lock13 hours ago|

[-]

Coming from a more "average imperative" background like C and Java, outside of compiler or serde context, I don't think "parse" is a frequently used term there. The idea of "checking values to see whether they fulfill our expectations or not" is often called "validating" there.

So I believe the "Parse, Don't Validate" catchphrase means nothing, if not confusing, to most developers. "Does it mean this 'parse' operation doesn't 'validate' their input? How do you even perform 'validation' then?" is one of several questions that popped up in my head the first time I read the catchphrase prior to Haskell exposure.

Something like "Utilize your type system" probably makes much more sense for them. Then just show the difference between `ValidatedType validate(RawType)` vs `void RawType::validate() throws ParseError`.

reply

upvote

by tialaramex56 minutes ago|

[-]

The crucial design choice is that you can't get a Doodad by just saying oh, I'm sure this is a Doodad, I will validate later. You have to parse the thing you've got to get a Doodad if that's what you meant, and the parsing can fail because maybe it isn't one.

    let almost_pi: Rational = "22/7".parse().unwrap();

Here the example is my realistic::Rational. The actual Pi isn't a Rational number so we can't represent it, but 22 divided by 7 is a pretty good approximation considering.

I agree that many languages don't provide a nice API for this, but what I don't see (and maybe you have examples) is languages which do provide a nice API but call it validate. To me that naming would make no sense, but if you've got examples I'll look at them.

reply

upvote

by LordDragonfang5 hours ago|

[-]

I'll be honest, as someone not familiar with Haskell, one of my main takeaways from this article is going down a rabbit hole of finding out how weird Haskell is.

The casualness at which the author states things like "of course, it's obvious to us that `Int -> Void` is impossible" makes me feel like I'm being xkcd 2501'd.

reply

upvote

by mrkeen4 hours ago|

[-]

If you spend your life talking about bool having two values, and then need to act as if it has three or 256 values or whatever, that's where the weirdness lives.

In C, true doesn't necessarily equal true.

In Java (myBool != TRUE) does not imply that (myBool == FALSE).

Maybe you could do with some weirdness!

In Haskell: Bool has two members: True & False. (If it's True, it's True. If it's not True, it's False). Unit has one members: () Void has zero members.

To be fair I'm not sure why Void was raised as an example in the article, and I've never used it. I didn't turn up any useful-looking implementations on hoogle[1] either.

[1] https://hoogle.haskell.org/?hoogle=a+-%3E+Void&scope=set%3As...

reply

upvote

by tialaramex41 minutes ago|

[-]

What were you expecting to find? A function which returns an empty type will always diverge - ie there is no return of control, because that return would have a value that we've said never exists. In a systems language like Rust there are functions like this for example std::process::exit is a function which... well, hopefully it's obvious why that doesn't return. You could imagine that likewise if one day the Linux kernel's reboot routine was Rust, that too would never return.

reply

upvote

by danieltanfh956 hours ago|

[-]

Hot take: Static typing is often touted as the end all be all, and all you need to do is "parse, don't validate" at the edge of your program and everything is fine and dandy.

In practice, I find that staunch static typing proponents are often middle or junior engineeers that want to work with an idealised version of programming in their heads. In reality what you are looking for is "openness" and "consistency", because no amount of static typing will save you from poorly defined or optimised-too-early types that encode business logic constraints into programmatic types.

This is also why in practice alot of customer input ends up being passed as "strings" or have a raw copy + parsed copy, because business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program.

reply

upvote

by steve_adams_866 hours ago|

[-]

> no amount of static typing will save you from poorly defined or optimised-too-early types that encode business logic constraints into programmatic types.

That's not a fault of type systems, though.

> because business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program

That's a problem with overly-tight coupling, poor design, and poor planning, not type systems

> In practice, I find that staunch static typing proponents are often middle or junior engineeers

I find people become enthusiastic about it around intermediate stages in their career, and they sometimes embrace it in ways that can be a bit rigid and over-zealous, but again it isn't a problem with type systems

reply

upvote

by mh22665 hours ago|

[-]

how does this square with very senior people putting in a lot of effort to bolt fairly good type systems onto Python and JavaScript?

> business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program.

I just don't understand how this is the case. Fields or methods or whatever are either there, or they are not. Type systems just expose that information. If you need to change the types later on, then change the types.

Example: Kotlin allows you to say "this field will never be null", and it also allows you to say "this field will either be null or not null". Java only allows the latter. If you want the latter in Kotlin, you can still just do that, and now you're able to communicate that (or the other option) to all of your callers.

Typed Python allows you to say "yeah this function returns Any, good luck!" and at least your callers know that. It also allows you to say "this function always returns a str".

reply

upvote

by jghn6 hours ago|

[-]

> I find that staunch static typing proponents are often middle or junior engineeers

I wouldn't go this far as it depends on when the individual is at that phase of their career. The software world bounces between hype cycles for rigorous static typing and full on dynamic typing. Both options are painful.

I think what's more often the case is that engineers start off by experiencing one of these poles and then after getting burned by it they run to the other pole and become zealous. But at some point most engineers will come to realize that both options have their flaws and find their way to some middle ground between the two, and start to tune out the hype cycles.

reply

upvote

by solomonb6 hours ago|

[-]

This is such a tired take. The burden of using static types is incredibly minimal and makes it drastically simpler to redesign your program around changing business requirements while maintaining confidence in program behavior.

reply

upvote

by beastman825 hours ago|

[-]

> staunch static typing proponents are often middle or junior engineeers

While we're sharing anecdotal data, I've experienced the opposite.

The older, more experienced fellows love static types and the new ones barely understand what they're missing in javascript and python.

reply

upvote

by yakshaving_jgt4 hours ago|

[-]

You are yet another person who is misguided in exactly the way described in this article by the same author: https://lexi-lambda.github.io/blog/2020/01/19/no-dynamic-typ...

reply

upvote

by waffletower5 hours ago|

[-]

I'm sorry, I don't like to title drop, but I am a Staff Data Engineer and I find that "type driven" development is an inappropriate world view for many programming contexts that I encounter. I use "world view" carefully as it makes a contractual assumption about reality -- "give me what I expect". Data processing does not always have the luxury of such imposition. In these contexts a dynamic and introspective world view is more appropriate, "What do we have here?" "What can we use?". In 2019 I would have felt crippled by use of Haskell in data processing contexts and have instead done much in Clojure in these intervening years, though now LLM assisted use of Haskell toward such tasks would be a fun spectator sport.

reply

upvote

by kstrauser1 hours ago|

[-]

To amplify what yakshaving said, this may be the worst forum in the entire industry to title drop in. Half the people in any given article's comments are a CxO or Chief or Head or Director or Founder or whatever, or wrote the article, or invented the technology in the article, or are otherwise renowned for something or another.

See also: "Did you win the Putnam?"

reply

upvote

by yakshaving_jgt4 hours ago|

[-]

> I don't like to title drop, but I am a Staff Data Engineer

I am a Chief Technology Officer[^1].

Your opinion here is common, and misguided.

Here is why: https://lexi-lambda.github.io/blog/2020/01/19/no-dynamic-typ...

---

[^1]: Literally nobody cares.

reply

upvote

by waffletower3 hours ago|

[-]

That's an insular opinion piece that doesn't sway, especially in the age of AI agents, it has not aged well. Its shallow rejection of Rich Hickey's nuance, is also unconvincing. It is a polemical justification for a coding philosophy that is incomplete and dishonest about the benefits of alternatives. Thanks for reminding me that no one cares; important to reinforce that.

reply