Code that is coincidentally similar very often diverges in either the short or long term, and DRYing it up aggressively tends to result in functions that have many boolean parameters that each trigger disjoint sets of behavior - which is a bit of a nightmare to maintain due to the high cognitive overhead of remembering how all the interleaved-but-actually-unrelated behaviors should work.
This outcome is low-cohesion code.
It's a useful concept to be aware of - worth clicking through to the actual content of the talk rather than just the headline.
I've seen this article and AFAIR the video before, and FWIW having been a Rails developer from the very early days and fitfully until maybe even 2014, I now interpret the phrase "my Railsconf talk…" quite negatively.
ETA: nice to be back to disagreeing with people on HN about coding principles again though. Hopefully this is a sign.
It would be iconoclastic if the common sense basic approach would be to start with abstraction. It's not, the common sense default is to write possibly duplicate behavior until you actually discover several cases to abstract away, until you bevalop a sensible idea of which functionality unites them and which doesn't carry over all of them.
>Once you have an awkward number of customers (more than five and less than a hundred), maintaining duplicated code that should have been abstracted and modularised will only seem cheap if you don't mind that you burn through even junior employees at a pace
Maintaining the wrong abstraction, or, god help, abstractions, would be even worse.
Hard disagree. When you've had to chase through a change in untold and actually unknown numbers of duplications of code in different permutations and fix them because they are all on fire simultaneously, you'd disagree too. A bad abstraction would at least have had one fire in one place.
The other end of this spectrum is dealing with the architecture astronaut's up-front abstraction. Totally overengineered for solving the initial requirements, but then constantly needing new hacks to make it cope with new requirements as they come up in the normal course of work.
That's why there's a balance in there, it's somewhere between "always duplicate code even when you know a lot about the problem" and "always write abstractions even when you know very little about the problem."
Wouldn't most large codebases with poor abstractions just have engineers engineer around them with their own solutions? In a large enough codebase you'd have both the bad abstractions and all the not-quite-duplicate implementations ignoring the bad abstraction?
I'm using bad here loosely, it could be buggy, incorrect, incomplete, insufficient and more; while being owned by someone or some team that's a challenge to work with for various reasons (overloaded, under-resourced, overbearing, etc., etc.).
Obviously, yes. But it is my experience that this happens more slowly and that API invocations that break when the abstraction is changed are much easier to identify than broader duplicated patterns of code that span many lines and subtly diverge.
And even then those divergences are better because each wrapper around the abstraction is documenting the problem with it. But the abstraction can generally be replaced by one with the same API surface.
(Even if you take into account the fact that any API behaviour ultimately gets relied upon even if undocumented. Which is true.)
To be fair my experience is that of a freelancer and contractor who arrives trying to fix things that have been through many such hands. And I think if these developers had it drummed into their head that any attempt at abstraction would be better than copy and paste, these situations would be more knowable.
When that happens there's a major engineering leadership failure currently in progress, even if engineering leadership isn't aware of it.
EDIT: LLM or not, this is still true. If you have LLMs pumping out tons of duplicate code you're wasting tokens, and probably more importantly wasting engineer hours reviewing duplicate code.
In some cases it might be a fair trade, in moderation. In general it's certainly wrong.
That's true only for "good" abstractions. Bad abstractions will often require you to change code in all the places using it, requiring you to understand how all of them work and what are their requirements, _all at the same time_.
A uses the abstraction, but finds the API doesn't work. Fixes that.
That causes B to have to make a tracking change which induces a bug. B realizes that the API isn't quite right. Fixes it.
That causes A and C to make tracking changes. These induce more bugs. C fixes the abstraction to avoid these cases.
This breaks A and B so they decline to update.
And so on. This is what a bad abstraction looks like. API "fixes" bouncing around the code as they reflect off of the bad abstraction.
The security bugs were all in features I never wanted.
A bit of simple duplication would have been golden.
On the contrary: that's precisely what a bad abstraction would not offer.
Instead it would spread its assumptions to different parts of the system, as every caller, sub-service, etc. would have to change shape to fit in that abstraction's box, however unnatural it is (and we know it would be unnatural, because we already said it's a bad abstraction).
Abstraction is not the same as encapsulation.
But so does duplication, in practice, and it diverges more as it does.
But any abstraction ends up with a signature and a name that can quickly be found in code.
The risk of a long-lived duplication losing its shape and being hard to find is much greater. Especially if the code is going through multiple hands.
I once had to pick up a project — a working, fully functional website. I could see, pretty clearly, the work of several people. All but one of them terrible.
The one was a diligent developer who was fully wrong in their abstraction (in fact significantly) but was consistent in how they used it.
The rest had simply worked around that code, copied and re-copied their own modified duplications and let things lose any shape. The result was error-prone stuff.
Clearly either the budget (or the client's capriciousness — a separate issue and arguably the bigger one) scared away the one guy, who I actually wanted to talk to but could not track down. He possibly had the origin story, and I wanted to know why his particular abstraction, which was at odds with the framework, was there. It was good code in the wrong shape, and it clearly used to do more, and that is interesting.
All the expedient people who had decided to avoid his code and just patch in duplicated pieces around it were the problem. There was no form to their solution at all. And that had clearly happened over some time (because you could see several different code styles)
Abstractions are a form of coupling, and coupling can be good, if the components are truly interdependent, and have a well defined domain. The problem with most abstractions, and I’ve seen this time and time again, is that they become brittle, are over used, and the cost of maintaining them grows exponentially with the size of the code base. With the reason for the cost ballooning being the system has disparate components that look interrelated but are absolutely not. Once you give someone a hammer they tend to assume everything is a nail.
The biggest problem, IMHO, is that abstractions are often used where a pattern would be more effective, easier to maintain, and easier to iterate on. And the primary difference between a pattern and an abstraction really comes down to coupling. Patterns remain decoupled, abstractions are tightly coupled.
And to be clear, I will and do use abstractions, when and where they make sense. But only after clear patterns emerge, and it’s been proven that components are truly coupled.
I will gladly die on the hill, that abstractions are measurably worse than duplication an overwhelming amount of the time. They’re often nothing more than a form of premature optimization.
It all depends on the amount of duplication and the complexity of the abstraction. Like you said, no generic advice is possible that clearly separates it into "abstract here" and "duplicatehere".
In your example it sounds like we aren't talking about 2-3 places where duplicate code existed that just needed to be refactored into separate units. It sounds more like a complete disregard for abstraction to move on quickly.
If you see duplicate code and have a good understanding how to solve that then it's totally a good thing. The real problem comes in if you add abstractions without knowing wether they will hold up. And this is where the blogpost comes in. In my opinion 2 duplicates are fine, at 3 you should start thinking or implementing an abstraction if you have a good understanding of the code and usecases.
Exactly. The abstraction purists are not working in the messy, dead line driven real world.
Write everything twice quickly becomes write everything 4 times once a new change appears, just as quickly as it becomes write everything 8 times, and so on.
I'm afraid there's no sensible soundbite developers can follow blindly.
That's a good problem to have. Getting to 4 or 8 or 12, and then pruning it to 1 or maybe 2 or 3 clearly different cases, is better than shoehorning multiple cases into the wrong abstraction, having everything that speaks with them coupled to that and dancing around their assumptions, and then having to untangle that.
Duplicated code is by definition LESS coupled.
Having a lot of if/else in your code is definitely a cost. My weakness isn’t so much the libraries and APIs, but the actual binary - once I have a service that does A very well, and I run into needing A’ I mostly just add in a config line “op_mode = A|A’” and have the else/if chains in the server driving code. Moreso for CLIs that I use myself than production services, but I have added tunables for consistency and replication to datastores to allow new use cases and expand my footprint in the data center.
If you haven't figured out a good abstraction at 5-100 customers, God help you.
Half of your abstractions are wrong. The hard part is knowing which half.
This is tautological though, it's like saying “starving is much better than eating the wrong food” (for instance: eating quick lime).
Of course you'll always find a way to do things wrong in a way that is costlier than not doing anything.
But also it's very possible to not realise you needed an abstraction until it catches fire in multiple places.
And quite often it's not you that got the codebase to a hundred customers, is it? Sometimes it is a sequence of fresh-faced young developers who didn't have the authority to say "this duplication is bullshit" and were instead compelled to repeat it.
I think a lot of these discussions happen in nice little blog-post vacuums of progressive thinking, where people can go "mmm, object oriented coding obscures intent and clarity, mmm", blog posts with "an X is a Y", "the unreasonable effectiveness of foobar" etc.
In the real world, every duplication that works sticks for good; there is rarely budget to electively replace code that isn't broken. Until one day it doesn't work. And then… how many times is it actually duplicated? How many of the duplicates diverged? How many of these do we no longer need?
So... the wrong abstraction, no matter how bad, is better than code duplication?
> I would go as far as to say that any abstraction you can maintain (that is in active maintenance, I mean) is better than code duplication once you are past a de minimis threshold.
I appear to be in a solid minority thinking this. But I'm OK with it. I'm probably not going to write a blog post.
This blend of opinion is very naive. Every single project is a business requirement away from having the wrong abstraction in place.
Of course it's a truism if you just say any abstraction that works is a good abstraction.
That is not what I am saying at all. Bullshit abstractions at least let you control the problem. Duplication doesn't.
I agree with you that it’s a truism, but it’s useful advice for people who have a habit of trying too hard to DRY their code. IIRC the author comes from the Ruby world, where DRY was a big thing, and this talk was part of the pendulum swinging back away from this DRY obsession that sometimes just resulted in convoluted code.
I agree that LLMs are naturally anti abstraction machines.. I'm often trying to find way to reverse that.
I am a bit of an LLM cynic but I am trying to learn it all, and I have to say I have spent most time trying to work out: how do you explain how a brown-field codebase actually works, in such a way that the LLM won't pervert it through misunderstanding.
It does encourage you towards the "conventional" coding standard for any new project, because you want to use a pattern that it will have seen in its training set.
But for example there are differences of opinion in how wordpress plugins (which have a very complex control flow) should be structured. LLMs are incredible at knowing how WP works, actually, but what is difficult is explaining how your methodology for a large plugin is going to work.
It is a battle — but a useful one because it can be used for, er, studying the comparative belief systems of the LLMs.
But if I tell it "read these files that use the same conventions" first, there's no misunderstanding, and the agent also picks up the general "tone" of the code. I have very little to tweak if I've defined the problem well.
Oh that is a bloomin' great idea, and I can fully see how it might work better.
Can't tell you how valuable this comment has been to me and now I feel so much better about evidently kicking a hornet's nest ;-) Thank you so much.
A story I like is that in the now lost era of handwriting recognition on PDAs, Jef Raskin concluded that the easiest way to solve the problem was to change handwriting so as to meet the algorithm in the middle.
That is, to find a noticeable simplification of handwriting that people could learn quickly and that eliminated hard-to-process quirks.
I feel I am there with the LLM at the moment, trying to work out what the common ground is.
It really depends on the exact type of code we're working with, and what our objectives are.
In my case, I often use object inheritance. It's a damn cheap way to DRY. However, when people hear "inheritance," they often think "polymorphism." There's a really big difference between the two, but popular culture has jammed them into one ball, and it's not worth the agita, to try to explain the difference.
But if you are doing optimization, long stacks can be your enemy, and inheritance tends to have long, windy stacks.
In these cases, the copy/pasta method may well be the best approach.
Like I said, "It Depends."
I agree that we should think of inheritance and polymorphism separately. If we want to express this intent in object-oriented code, how can we use inheritance to deduplicate code, while preventing misuse of the resulting object hierarchy i.e. the use of base classes in a polymorphic context? In C++, IIRC private inheritance would do the trick (you cannot static_cast DerivedWidget * to BaseWidget * if DerivedWidget : private BaseWidget), but most OO languages don't support private inheritance.
ideal case: support libraries and then very simple duplicated code that is easy to read and modify. critically the core control flow should remain duplicated, but simplified by the support libraries.
Everyone always thinks duplication is fine when you can bill the modifications by the hour. But they never think to understand that the reason they've had so many employees is that they've turned their change process into firefighting all the different versions of the same code and all these young developers burn out from the sheer anxiety of not knowing where all the little fires are.
I once had to rescue a site that had become a victim of its own popularity, that was written by subcontractors who clearly believed that duplication is better than the wrong abstraction.
Until one day, along came a change — MySQL 4 to MySQL 5 — and a significant duplicated query no longer worked due to its new, proper strictness.
The problem was compounded; not only was the broken pattern in hundreds of places where it had sat, stable and predictable, but the pattern was broken because it, itself, was avoidance of another abstraction that would solve it.
They quit: they said they couldn't and wouldn't fix it. It had always worked how they had done it, and it would have to stay on MySQL 4 (which the hosting provider refused to accommodate).
I don't think it helped that they were severely misguided in their understanding of SQL, but the code had become beholden to duplication and then crippled by a new problem in the duplicated pattern.
I had to first find all the contexts in which that pattern appeared (which required me to spend half a day on a bespoke script) and then work out a new pattern and as few variations of it as possible to fix the duplicated code in each place, because there was no proper budget to rewrite the whole thing. And then I sat at my desk, for days, working through each one, figuring out how to change it to fit the slightly different expression of the pattern.
Even a total bullshit abstraction would have saved that client both time and money. And this is only one of dozens of times I've seen small firms simply duplicate and change code that would later become unmaintainable because of a straw breaking a camel's back.
I would be curious if the previous coders you're talking about actually cited duplication as a good thing. You seem to be implying they are. But almost every instance I've seen of massive code duplication was just from bad programmers shooting from the hip, not from some ideological stance.
Right. But this is a hypothetical, in-a-vacuum situation.
In the real world, your two, three duplicates are in production.
"We really should now de-duplicate this"
"There is not the time or budget, just copy it again; we'll replace all this one day".
Pretty much everyone arguing for duplication has argued what you are saying, which is wait to see a few instances of it before committing to an abstraction. No one is saying duplicate everything 100 times. So I don't think this discussion was ever iconoclastic.
In the real world, duplication happens in an emergent way, there isn't the time each time to judge whether it's really time to just quietly abstract that code, you may not get the permission, budget or window to do it, and if you don't stop the rot really early you are locked into the pattern.
Starting with abstraction when you are only beginning something rarely works well and leads to code bases littered with interfaces having only one implementation.
Abstracting the code when you have two copies does not always pay off, especially when you end up not needeing more than just two copies anyway.
But once you have three copies, it's indeed time to start generalizing.
The context a decision is evaluated is particularly important for "rules of thumb" like this. There's the rule of 3 (which many senior engineers imparted to me earlier on in my career) - don't refactor until you've actually duplicated it thrice, but even so, what they speak of is a catch-22 that's pretty important to reason about carefully.
On one hand, if you overcorrected on the fear of abstraction, you could easily end up with 500 duplicates that are slightly different and need to be maintained 500 different ways, slowly causing slightly wrong behavior some of the time, data corruption, combinatoric explosion. Surely, once there is such a situation, some degree of abstraction is the only right decision.
On the other hand, if you overcorrected on the fear of duplication early on, you could easily end up with a premature optimization and complexity -- complexity which, most importantly, could be rooted in a gap of understanding of how the code will be used and what direction it may go in over time (often based on which direction the business will go over time).
The only answer that actually works, of course, is "somewhere in the middle." Obviously, that's pretty vague and not very useful. Where, exactly, in the middle IS the right place?
As the years have gone by, I've become more and more steadfast that the answer to that question is and must be an art and not a science. Of course, it must always be rooted in practicality, the actual context of the code around it and where the code/business was in the past and where it will be in the future.
But just as importantly, some of it must be based around beliefs in the face of imperfect information about what you want to invest in for the sake of the technology, the team that develops it, and the business that relies on it. It could be that for your team, your values make it make sense to go a little bit further than "good enough" on normalizing your data modeling, because the way you like to run your business requires that normal form to do the analytics and make decisions productively. It could be that for your team, your values make it make sense to go a little bit further than "good enough" on splitting service boundaries and ensuring clean queues and message passing infrastructure, because you have seasonal spikes where you need to scale up to a ton of load and then scale down after without constantly doing a song and dance or pre-provisioning fragile infrastructure.
But the most common thread there is - art, not a science. Every single decision depends on YOUR team, YOUR business, YOUR needs - and like any art, there is no universal rule or discovery or best practice in the industry that will magically work for your needs without working through the details of whether it appropriately fits your situation or not.
So with that said - I can't really agree with you. At any place I've ever worked with a competent team, maintaining duplicate code is just not that hard and follows the same process for being dealt with. Built a robust test suite that encodes the actual differences and the shared structure. Pull out the pieces that have a good reason to be abstracted and redesign the pieces that encode the true differential structure in a way that is intuitive. Lather rinse repeat. It's always straightforward because it's known - by the time you are doing this process, you've had tons of repetitions and data on what is driving you to develop the abstraction, so when you make the decision, you are making it empirically.
Conversely, I have seen many otherwise competent teams slowed to a halt with premature abstraction. Frameworks that were well intended and reduced duplication, but encoded coupling between components that at a certain point in the businesses progression, fought with reality rather than aided, and all because they were frozen into place before anyone empirically had really clear data about whether the abstraction would be worth it long term. Well intended "clean code" refactors that were meant to solve the old "bad duplication" but instead created a far more difficult to reason about "abstracted base" of code that didn't really solve any of the domain modeling problems and was just as difficult to maintain without introducing buggy behaviors (if not more so) than before.
The biggest problem is that premature abstraction is sexy and fun. There are incentives and dopamine hits from doing it extraneously. But fixing legacy duplication is not fun. And so when it gets done, it tends to get done in a pragmatic way to relieve pain rather than to elicit pleasure. That, I believe is one of the biggest confounding sociological aspects of this whole discussion.
But in one of the scenarios I mention earlier, I earned a chunk of money once fixing an issue that emerged in a subcontractor's four or five line duplication that had ended up rippled through a long-lived codebase. A ground truth (MySQL version) changed, and the pattern broke everywhere, including places where it had evolved.
So I tend towards thinking, yes, any three-line pattern that is likely to appear everywhere should, perhaps, be centralised.
It's certainly worthy of serious consideration. Usually pretty easy to maintain the surface of such an abstraction.