upvote
The intersection of uptime across every possible service they offer isn't a particularly great metric. I get the point that they are doing badly, but it makes it look worse than I think it really is.

What I would like to see is a combined uptime for "code services", basically Git+Webhooks+API+Issues+PRs, which corresponds to a set of user workflows that really should be their bread & butter, without highlighting things you might not care about (Codespaces, Copilot).

reply
Depends how integrated those features are.

A service's availability is capped by its critical dependencies; this is textbook SRE stuff (see Treynor et al., The Calculus of Service Availability). Copilot may well be on the side of it (and has the worst uptime, dragging everything down), but if Actions depends on Packages then Actions can be "up" while in reality the service is not functional. If your release pipeline depends on Webhooks, then you're unable to release.

The obvious one is git operations: if you don't have git ops then basically everything is down.

So; you're right about Copilot, but the subset you proposed (Git+Webhooks+API+Issues+PRs) has the exact same intersection problem. If git is at one nine, that entire subset is capped at one nine too, no matter how green the rest of it looks.

And to be clear: git operations is sitting at 98.98% on the reconstructed dashboard linked above[1]. That is one nine. Github stopped publishing aggregate numbers on their own status page, which.. tells you something.

[1]: https://mrshu.github.io/github-statuses/

reply
Well yes you could do that on a status page, but it's basically just lying to put Actions as green if it's actually down because it depends on Packages which is red.

With that set, I wasn't proposing a set of totally independent services to be grouped together, I was talking about a set of things that I think represent pretty core services for Github users. If Git is dragging the rest of those down, fine; PRs are useless without it. In fact it is worse than some but it's not the worst of that group, and it is still a lot better then the dregs of Actions and Copilot.

Having said that, the numbers are of course terrible, two nines on a couple of things and one on everything else would be bad for a startup, it's an utter embarrassment for a company that's been doing this over a decade.

reply
also I never had considered that breaking your up-time into a bunch of different components is just a strategy to make your SRE look better than it actually is. The combined up-time tells the real story (88%!). Thanks for the link
reply
The number of nines assigned to a suite of services is not indicative of the quality of SRE at any given company, but rather a reflection of the tradeoffs a business has decided to make. Guaranteed there's a dashboard somewhere at Github looking at platform stickiness vs. reliability and deciding how hard to let teams push on various initiatives.
reply
this is fair. I should have just said "Site Reliability", as it's almost certainly out of the engineers control.
reply
ya i was just doing the math on their chart for the git operations. I added up 14.93 hours combined hours, which puts them WAY lower than the reported 99.7 metric they show right next to it.

So based on their own reporting, the uptime number should be 99.31. Which means only like 6 additional hours and they'd fall below 99.0%

reply