upvote
to be entirely fair that's quite a high bar even for most "traditional" open source.

And even if you had the same data, there's no guarantee the random perturbations during training are driven by a PRNG and done in a way that is reproducible.

Reproducibility does not make something open source. Reproducibility doesn't even necessarily make something free software (under the GNU interpretation). I mean hell, most docker containers aren't even hash-reproducible.

reply
Yes, this is true. A lot of times labs will hold back necessary infrastructure pieces that allow them to train huge models reliably and on a practical time scale. For example, many have custom alternatives to Nvidia’s NCCL library to do fast distributed matrix math.

Deepseek published a lot of their work in this area earlier this year and as a result the barrier isn’t as high as it used to be.

reply