undefined

points

[-]

I wonder who left the team recently. Must be someone bagged with shadow knowledge. Or maybe they send devops/devs work to another continent.

by jsheard8 hours ago|

parent|

[-]

They're in the process of moving from "legacy" infra to Azure, so there's a ton of churn happening behind the scenes. That's probably why things keep exploding.

by estimator72928 hours ago|

parent|

[-]

I don't know jack about shit here, but genuinely: why migrate a live production system piecewise? Wouldn't it be far more sane to start building a shadow copy on Azure and let that blow up in isolation while real users keep using the real service on """legacy""" systems that still work?

by chickenpotpie7 hours ago|

parent|

[-]

Because it's significantly harder to isolate problems and you'll end up in this loop

* Deploy everything * It explodes * Rollback everything * Spend two weeks finding problem in one system and then fix it * Deploy everything * It explodes * Rollback everything * Spend two weeks finding a new problem that was created while you were fixing the last problem * Repeat ad nauseum

Migrating iteratively gives you a foundation to build upon with each component

by wizzwizz47 hours ago|

parent|

[-]

So… create your shadow system piecewise? There is no reason to have "explode production" in your workflow, unless you are truly starved for resources.

by paulddraper4 hours ago|

parent|

[-]

Does this shadow system have usage?

Does it handle queries, trigger CI actions, run jobs?

by wizzwizz42 hours ago|

parent|

[-]

If you test it, yes.

Of course, you need some way of producing test loads similar to those found in production. One way would be to take a snapshot of production, tap incoming requests for a few weeks, log everything, then replay it at "as fast as we can" speed for testing; another way would be to just mirror production live, running the same operations in test as run in production.

Alternatively, you could take the "chaos monkey" approach (https://www.folklore.org/Monkey_Lives.html), do away with all notions of realism, and just fuzz the heck out of your test system. I'd go with that, first, because it's easy, and tends to catch the more obvious bugs.

by chickenpotpie1 hours ago|

parent|

[-]

So just double your cloud bill for several few weeks, costing site like GitHub millions of dollars?

How do you handle duplicate requests to external services? Are you going to run credit cards twice? Send emails twice? If not, how do you know it's working with fidelity?

by throwway1203858 hours ago|

parent|

prev|

[-]

Why would you avoid a perfect opportunity to test a bunch of stuff on your customers?

by toast07 hours ago|

parent|

prev|

[-]

If you make it work, migrating piecewise should be less change/risk at each junction than a big jump between here and there of everything at once.

But you need to have pieces that are independent enough to run some here and some there, and ideally pieces that can fail without taking down the whole system.

by literallyroy8 hours ago|

parent|

prev|

[-]

That’s a safer approach but will cause teams to need to test in two infrastructures (old world and new) til the entire new environment is ready for prime time. They’re hopefully moving fast and definitely breaking things.

by paulddraper7 hours ago|

parent|

prev|

[-]

A few reasons:

1. Stateful systems (databases, message brokers) are hard to switch back-and-forth; you often want to migrate each one as few times as possible.

2. If something goes sideways -- especially performance-wise -- it can be hard to tell the reason if everything changed.

3. It takes a long time (months/years) to complete the migration. By doing it incrementally, you can reap the advantages of the new infra, and avoid maintaining two things.

---

All that said, GitHub is doing something wrong.

by helterskelter7 hours ago|

parent|

prev|

[-]

It took me a second to realize this wasn't sarcasm.

by hnthrowaway03158 hours ago|

parent|

prev|

[-]

Are they just going to tough through the process and whatever...

by perdomon8 hours ago|

parent|

prev|

[-]

I think it's more likely the introduction of the ability to say "fix this for me" to your LLM + "lgtm" PR reviews. That or MS doing their usual thing to acquired products.

by persedes5 hours ago|

parent|

prev|

[-]

rumors I've heard was that github is mostly run by contractors? That might explain the chaos more than simple vibe coding (which probably aggravates this)

by arccy8 hours ago|

parent|

prev|

[-]

nah, they're just showing us how to vibecode your way to success

by hnthrowaway03158 hours ago|

parent|

[-]

If the $$$ they saved > the $$$ they lose then yeah it is a success. Business only cares about $$$.

by collingreen7 hours ago|

parent|

[-]

Definitely. The devil is in the details though since it's so damn hard to quantify the $$$ lost when you have a large opinionated customer base that holds tremendous grudges. Doubly so when it's a subscription service with effectively unlimited lifetime for happy accounts.

Business by spreadsheet is super hard for this reason - if you try to charge the maximum you can before people get angry and leave then you're a tiny outage/issue/controversy/breach from tipping over the wrong side of that line.

by hnthrowaway03157 hours ago|

parent|

[-]

Yeah, but who cares about long-term? In the long term we are all dead. CEO only needs to be good for 5-10 max years, pop up stock prices and get applause every where and called as the smartest guy in the world.

by bartread8 hours ago|

prev|

[-]

I think the last major outage wasn't even two weeks ago. We've got about another 2 weeks to finish our MVP and get it launched and... this really isn't helpful. I'm getting pretty fed up of the unreliability.

by aloisdg8 hours ago|

prev|

[-]

Sure it is not vibe coding related