undefined

points

[-]

It seems like this is an approach that trades off scale and performance for operational simplicity. They say they only have 1GB of records and they can use a single committer to handle all requests. Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?

This is not to say it's a bad system, but it's very precisely tailored for their needs. If you look at the original Kafka implementation, for instance, it was also very simple and targeted. As you bolt on more use cases and features you lose the simplicity to try and become all things to all people.

by danhhz6 hours ago|

parent|

[-]

(author here)

> It seems like this is an approach that trades off scale and performance for operational simplicity.

Yes, this is exactly it. Given that turbopuffer itself is built on the idea of object storage + stateless cache, we're all very comfortable dealing with it operationally. This design is enough for our needs and is much easier to be oncall for than adding an entirely new dependency would have been.

by packetlost5 hours ago|

parent|

[-]

IMO this is the ideal way to engineer most (not all) systems. As simple as your needs allow. Nice work!

by loevborg10 hours ago|

parent|

prev|

[-]

> Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?

Conceptually that makes sense. How complicated is it to implement this failover logic in a safe way? If there are two processes, competing for CAS wins, is there not a risk that both will think they're non-leaders and terminate themselves?

by Normal_gaussian8 hours ago|

parent|

[-]

The broker lifecycle is presumably

1. Start

2. Load the queue.json from the object store

3. Receive request(s)

3. Edit in memory JSON with batch data

4. Save data with CAS

5. On failure not due to CAS, recover (or fail)

6. On success, succeed requests and go to 3

7. On failure due to CAS, fail active requests and terminate

The client should have a retry mechanism against the broker (which may include looking up the address again).

From the brokers PoV, it will never fail a CAS until another broker wins a CAS, at which point that other broker is the leader. If it does fail a CAS the client will retry with another broker, which will probably be the leader. The key insight is that the broker reads the file once, it doesn't compete to become leader by re-reading the data and this is OK because of the nature of the data. You could also say that brokers are set up to consider themselves "maybe the leader" until they find out they are not, and losing leadership doesn't lose data.

The mechanism to start brokers is only vaguely discussed, but if a host-unreachable also triggers a new broker there is a neat from-zero scaling property.

by staticassertion9 hours ago|

parent|

prev|

[-]

This is the hardest part because you can easily end up in a situation like you're describing, or having large portions of clients talking to a server just to have their writes rejected.

Further, this system (as described) scales best when writes are colocated (since it maximizes throughput via buffering). So even just by having a second writer you cut your throughput in ~half if one of them is basically dead.

If you split things up you can just do "merge manifests on conflict" since different writers would be writing to different files and the manifest is just an index, or you can do multiple manifests + compaction. DeltaLake does the latter, so you end up with a bunch of `0000.json`, `0001.json` and to reconstruct the full index you read all of them. You still have conflicts on allocating the json file but that's it, no wasted flushing. And then you can merge as you please. This all gets very complex at this stage I think, compaction becomes the "one writer only" bit, but you can serve reads and writes without compaction.

https://doi.org/10.14778/3415478.3415560

Note that since this paper was published we have gotten S3 CAS.

Alternatively, I guess just do what Kafka does or something like that?

by formerly_proven13 hours ago|

prev|

[-]

Write amplification >9000 mostly

by snowhale8 hours ago|

prev|

[-]

[dead]