undefined

points

[-]

Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?

I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like

- give 3d modelling task

- render and snapshot from a variety of angles

- feed to third-party vision model for a "what is this" type query

- grade on end-to-end accuracy

Bonus points for asking the vision model something like "how beautiful is this 1-10".

by ponyous4 hours ago|

parent|

[-]

I don't have the eval results live yet, so I cannot share them yet.

I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...

I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.

Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):

    <0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
    <0.4 → Weak – Partially relevant; significant omissions or errors.
    <0.6 → Fair – Covers main points but lacks completeness or precision.
    <0.8 → Good – Mostly accurate; minor gaps or deviations.
    <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.

Here is the scenario list (prompts are much more detailed):

    dragon-bottle-stopper
    editing-param-mid-conv
    editing-parametric-enclosure
    editing-swap-material-param
    editing-text-edit-cube
    multi-turn-bird-house
    multi-turn-dice-tower
    multi-turn-modular-planter
    multi-turn-phone-stand
    multi-turn-shelf
    one-shot-bookend
    one-shot-cable-clip
    one-shot-chess-queen
    one-shot-coaster
    one-shot-coffee-cup
    one-shot-dog-tag
    one-shot-dragon-figurine
    one-shot-hex-bracket
    one-shot-keychain-fob
    one-shot-low-poly-tree
    one-shot-pegboard-hook
    one-shot-pi4-case
    one-shot-threaded-jar

[0]: https://grandpacad.com

by NiloCK3 hours ago|

parent|

[-]

Very cool project. Thanks for sharing!

by ComputerGuru4 hours ago|

prev|

[-]

Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?

by ponyous4 hours ago|

parent|

[-]