I thought how come no one is trying to solve this problem. It looks like it's just a matter of time.
With that being said, my experience can be very skewed since prepbook is a passion project running on a VPS with essentially 0 scale. All I care about is the UX of the stack, not scale. Just for context.
What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?
"Getting it ready for production" is a different game.
I've fallen on my sword many times by trying to explain that prometheus fails every metric of production ready; in fact Google themselves replaced borgmon (prometheus) for Monarch because the "tiny unreliable time series databases everywhere" was in fact, not the successful and reliable deployment strategy that they had claimed.
But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic.
See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.
I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co
Production? As soon as you scale you need a proper solution. Prometheus (by itself) doesn't scale - you need Mimir or Thanos (or similar)