I dare say you will be hard pressed to find a dataset of significant size that doesn't have at least one invalid entry somewhere. Increasingly strict type rules will not fix that.
> I dare say you will be hard pressed to find a dataset of significant size that doesn't have at least one invalid entry somewhere
I agree. In my experience, and you've forgotten more than I have learned, the mark of a good data engineer is how they account for invalid entries or whether they get `/dev/null`ed.
> Checking the datatype is not the same as validating. There is lots of data out there that is invalid, and yet still has the correct type. In fact, that is the common case. Increasingly strict type rules will not fix that.
I am having a hard time letting go of this opportunity to learn from you, so in case you have time and read this again - When you say "There is lots of data out there that is invalid, and yet still has the correct type", I read "type" as "shape" or "memory layout" and "invalid" as "semantically wrong".
So, is a good example of this a value of `-1` for a person's age? The database sees a perfectly valid integer (the correct shape), but the business logic knows a person cannot be negative one years old (semantically invalid).
In that case, to be explicit, `0` or `14` is a "valid type" for an age (usually integer), but completely invalid data if it's sitting in an invoicing application for an adult-only business?
Again, thank you for your time and attention, these interactions are very valuable.
PS: I'm reminded of a friend complaining that their perfectly valid email would keep getting rejected by a certain bank. It was likely a regex they were using to validate email was incomplete and would refuse their perfectly valid email address