To be fair I was suprised too. But I made a relatively simple straight port from the AMD rays sdk plus some input from the pbrt-v4 CPU bvh code and it just worked relatively well out of the box...
This is the main intersection function which is quite simple:
https://github.com/JuliaGeometry/Raycore.jl/blob/sd/multityp...
I'm not even using local memory, since it was already fast enough ;)
But I think we can still do quite a lot, large parts of the construction code are still very messy, and I want to polish and modularize the code over time.