On the parameters, the relational tests use 5 million records per test. The exceptions are the key-value category, which uses 15 million records, and the embedded category, which uses 1 million records. The same dataset shape, workload, harness, and hardware are used across the engines being compared.
For WAL, the 2 to 16 GB range is not intended to be a limit based on the dataset size. For the published runs, the dataset is small enough that this should not be a bottleneck. The persistent runs are also full-durability runs, with Postgres using fsync and synchronous_commit.
We will update the benchmarks page so the versions, dataset sizes, and tuning details are easier to find without digging through the Rust source.
The full transparency would be very helpful to know where these strengths are coming from which at a glance look to be multi-threaded in-memory processing.