undefined

points

[-]

I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.

by tasuki2 hours ago|

prev|

[-]

Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...

by Manabu-eo6 hours ago|

prev|

[-]

How likely this problem is already on the training set by now?

by simonw5 hours ago|

parent|

[-]

If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.

by suddenlybananas4 hours ago|

parent|

[-]

Why would they train on that? Why not just hire someone to make a few examples.

by simonw4 hours ago|

parent|

[-]

I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.

by suddenlybananas4 hours ago|

parent|

[-]

But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.

by simonw4 hours ago|

parent|

[-]

The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.

by suddenlybananas4 hours ago|

parent|

[-]

When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.

by simonw3 hours ago|

parent|

[-]

The embarrassment of getting caught doing that would be expensive.

by throwup2385 hours ago|

parent|

prev|

[-]

For every combination of animal and vehicle? Very unlikely.

The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.

by recursive5 hours ago|

parent|

[-]

No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.

by svara5 hours ago|

parent|

[-]

More likely you would just train for emitting svg for some description of a scene and create training data from raster images.

by recursive27 minutes ago|

parent|

[-]

None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.

by zarzavat5 hours ago|

parent|

prev|

[-]

You can always ask for a tyrannosaurus driving a tank.

by verdverm5 hours ago|

parent|

prev|

[-]

I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too

by 6 hours ago|

parent|

prev|

[-]

deleted

by enraged_camel4 hours ago|

prev|

[-]

Is there a list of these for each model, that you've catalogued somewhere?

by throwup2386 hours ago|

prev|

[-]

The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)

by margalabargala5 hours ago|

parent|

[-]

It's not actually, look up some photos of the sun setting over the ocean. Here's an example:

https://stockcake.com/i/sunset-over-ocean_1317824_81961

by throwup2384 hours ago|

parent|

[-]

That’s only if the sun is above the horizon entirely.

by margalabargala4 hours ago|

parent|

[-]

No, it's not.

https://stockcake.com/i/serene-ocean-sunset_1152191_440307

by throwup2381 hours ago|

parent|

[-]

Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds.

by deron126 hours ago|

prev|

[-]

It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!

Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?

by gs175 hours ago|

parent|

[-]

It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.

by fvdessen5 hours ago|

parent|

[-]

maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh

by dfdsf25 hours ago|

parent|

prev|

[-]

Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.

by saberience5 hours ago|

prev|

[-]

Do you have to still keep trying to bang on about this relentlessly?

It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.

Again, like I said before, it's also a terrible benchmark.

by jeanloolz3 hours ago|

parent|

[-]

I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?

by Davidzheng5 hours ago|

parent|

prev|

[-]

Eh, i find it more of a not very informative but lighthearted commentary

by simonw4 hours ago|

parent|

prev|

[-]

It being a terrible benchmark is the bit.

by dfdsf25 hours ago|

prev|

[-]

Highly disagree.

I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.

If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.

by chriswarbo5 hours ago|

parent|

[-]

I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.

In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.

I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.

by peaseagee5 hours ago|

parent|

prev|

[-]

The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.