I built something similar that takes a screenshot and uses a multi-modal LLM to evaluate it against a design mock. It catches a completely different class of error. The DOM can be structurally perfect and still look nothing like what was intended. Colors wrong, layout shifted, spacing off, components overlapping. No amount of DOM assertions will catch that.
These are two different kinds of gates: structural which are fast and deterministic, and stochastic which are slow but catch things that are completely different. There is very little overlap between the issues, and you want to catch both.
That way I can invest a lot of time getting the mock just right, then let the agents "make it so".
The critical part is that viewed at a high level, this method tests something different, which means it catches different errors.
If coding agents are given the Playwright access they can do it better actually because using Chrome Developer Tools Protocol they can interact with the browser and experiment with things without having to wait for all of this to complete before making moves. For instance I've seen Claude Code captures console messages from a running Chrome instance and uses that to debug things...
At work, we've integrated claude code with gitlab issues/merge requests, and we get it to screenshot anything it's done. We could use the same workflow to screenshot (or in this case, host a proofshot bundle of) _any_ open PR. You would just get the agent to check out any PR, get proofshot to play around with it, then add that as a comment. So not automated code reviews, which are tiresome, but more like a helpful comment with more context.
Going to try out proofshot this week, if it works like it does on the landing page it looks great.
This is sick OP based on what's in the document, it looks really useful when you need to quickly fix something and need to validate the changes to make sure nothing has changed in the UI/workflow except what you have asked.
Also looks useful for PR's, have a before and after changed.
A few days ago I had a interaction with codex that roughly went as follows, "this chat window is scrolling off screen, fix", "I've fixed it", "No you didn't", "You are totally right, I'm fixing it now", "still broken", "please use a headless browser to look at the thing and then fix it", "....", "I see the problem now, I'm implementing a fix and verifying the fix with the browser", etc. This took a few tries and it eventually nailed it. And added the e2e test of course.
I usually prompt codex with screenshots for layout issues as well. One of the nice things of their desktop app relative to the cli is that pasting screenshots works.
A lot of our QA practices are still rooted in us checking stuff manually. We need to get ourselves out of the loop as much as possible. Tools like this make that easier.
I think I recall Mozilla pioneering regression testing of their layout engine using screenshots about a quarter century ago. They had a lot of stuff landing in their browser that could trigger all sorts of weird regressions. If screenshots changed without good reason, that was a bug. Very simple mechanism and very effective. We can do better these days.
Added benefit is that when Claude navigates and finds a bug, it will either add them to a list for human review or fix it automatically.
Pretty much a loop where building and debugging work together;-)
Once Claude Code
https://github.com/ChromeDevTools/chrome-devtools-mcp/pull/1...
I've only used it a bit, but it's working well so far.
I don't think you need either, though, because agent-browser itself has a skill for this: https://github.com/vercel-labs/agent-browser/blob/main/skill...
Maybe the author would like to compare the three.
I give agent either a simple browser or Playwright access to proper browsers to do this. It works quite well, to the point where I can ask Claude to debug GLSL shaders running in WebGL with it.
It's not perfect though - I've personally found CC's VL to be worse than others such as Gemini but its nice to have it completely self contained.
This project desperately needs a "What does this do differently?" section because automated LLM browser screenshot diffing has been a thing for a while now.
So... Bypassing the whole "sees what it actually looks like in the browser. It can’t tell if the layout is broken" parent commentator is talking about? Seems worse, not better.
All the power to you if you build a product out of this, I don't wanna be that guy that says that dropbox is dead because you can just setup ftp. But with Codex/Claude Code, I was able to achieve this very result just from prompting.
I'd love to see an agent doing work, then launching app on iOS sim or Android emu to visually "use" the app to inspect whether things work as expected or not.
That's very different from scripting together what is effectively a whitebox test against document ids which is what people do with things like playwright. Replacing manual QA like that could be valuable.
Anyone recommend browser-base instant preview site for web ui design with more artistic/experimental preference?
I built something similar[0] a few months ago but haven't maintained it because Codex UI and Cursor have _reasonable_ tooling for this themselves now IMO.
That said there is still a way to go, and space for something with more comprehensive interactivity + comparison.
[0] - https://magiceyes.dev/
my claude drive his own brave autonomously, even for ui ?
but its great to see some other open source alternatives within this space as well.
From the OP, i don't think this is what is meant for what you are saying.
Tools like Claude and the like can, and do. This is just a utility to make the process easier.