Also, if we're being nitpicky, diffusion model inference has been proven equivalent to (and is often used as) a particular NF so.. shrug
The appendix goes on to explain, "We apply simple volume-preserving normalizing flows of the form z′ = z + b(z) to the samples generated by the encoder at each level".