In this benchmark, models can correctly solve Rust problems 61% on first pass — A far cry from other languages such as C# (88%) or Elixir (a “buggy dynamic language”) where they perform best (97%).
I wonder why that is, it’s quite surprising. Obviously details of their benchmark design matter, but this study doesn’t support your claims.