I often wonder about a macro-like thing where we could write a function using a subset of the language that’s simd aware. A bit higher level than using intrinsics or those simd libs
One example is Java, which will happily vectorize your code into AVX or SSE where possible.
Python just got a JIT compiler and we’ll start seeing the same thing soon.
But as someone else said here, some constructs don’t translate well and adding transformations to show vectorization would negate the perfomance gains.
Sad that the compiler (even Java) can’t explain you this and warn about it, but now with LLM, maybe they’ll start doing things like that soon.
Zig has MultiArrayList in the stdlib which does the SoA transform via comptime:
https://ziglang.org/documentation/master/std/#std.multi_arra...
Zig also sorts struct members by size/alignment, but has two escape hatches ('extern struct' which is for C compatibility, and 'packed struct' which offers an explicit bit-by-bit memory layout).
AFAIK Odin and Jai offer the SoA transform as specialized language features, e.g. in Odin:
https://odin-lang.org/docs/overview/#soa-data-types
I'd still always want such data layout transforms as an explicit language feature though, not the compiler making this decision for me.
I wonder if Futhark does? Eg https://futhark-lang.org/student-projects/pedersen-nelin-msc...
Out of this 1000x speedup you get 100x by just not using python though ;)
Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.
They do - they just can't assume GFNI instructions are present unless you explicitly say so: https://godbolt.org/z/eYasbKsse
Because they are not query compilers, ie: They don't know the data.
For example a query compiler could swap index to full scan because it "see" (by runtime statistics) the data not benefit for it.
In the other hand, an optimization here can pessimism there. So optimizers in general should be very conservative because butterfly effects!
a pragmatic approach: write in a high level interpreted language that rhymes with modern CPUs, vector extensions, memory bandwidth
e.g. apl [0], bqn [1], k [2], kiwi [3]
- vectors are dense (not boxed)
- optimized internal representation (e.g. bitpacked bool vectors)
- primitives act on vectors + use avx, neon if possible
[0] https://www.dyalog.com
[1] https://mlochbaum.github.io/BQN/
[2] https://kx.com
[3] https://kiwilang.comgreat article by marshall on BQN performance compared to C and how to think about it
https://mlochbaum.github.io/BQN/implementation/versusc.html
related:
- columnar databases: kdb, duckdb, clickhouse
- machine learning frameworks: pytorch, keras, jax, mlx