upvote
I think depends a lot on how much you value your own time, since its quite time consuming to write and update playwright scripts. It's gonna save you developer hours to write automations using natural language rather than messing around with and fixing selectors. It's also able to handle tasks that playwright wouldn't be able to do at all - like extracting structured data from a messy/ambiguous DOM and adapting automatically to changing situations.

You can also use cheaper models depending on your needs, for example Qwen 2.5 VL 72B is pretty affordable and works pretty well for most situations.

reply
But we can use an LLM to write that script though and give that agent access to a browser to find DOM selectors etc. And than we have a stable script where we, if needed, manually can fix any LLM bugs just once…? I’m sure there are use cases with messy selectors as you say, but for me it feels like most cases are better covered by generating scripts.
reply
Yeah we've though about this approach a lot - but the problem is if your final program is a brittle script, you're gonna need a way to fix it again often - and then you're still depending on recurrently using LLMs/agents. So we think its better to have the program itself be resilient to change instead of you/your LLM assistant having to constantly ensure the program is working.
reply
I wonder if a nice middle ground would be: - recording the playwright behind the scenes and storing - trying that as a “happy path” first attempt to see if it passes - if it doesn’t pass, rebuilding it with the AI and vision models

Best of both worlds. The playwright is more of a cache than a test

reply
I think the difficulty with this approach is (1) you want a good "lookup" mechanism - given a task, how do you know what cache should be loaded? you can do a simple string lookup based on the task content, but when the task might include parameters or data, or be a part of a bigger workflow, it gets trickier. (2) you need a good way to detect when to adapt / fall back to the LLM. When the cache is only a playwright script, it can be difficult to know when it falls out of the existing trajectory. You can check for selector timeouts and things, but you might be missing a lot of false negatives.
reply
Are you sure? Couldnt you just just go back to the LLM if the script breaks? Pages changes but not that often in general.

It seems like a hybrid approach would scale better and be significantly cheaper.

reply
We do believe in a hybrid approach where a fast/deterministic representation is saved - but think there is a more seamless way were the framework itself is high level and manages these details by caching the underlying actions that can run
reply
I think you are overstating. Just use Playwright codegen. No need for manual test writing, or at least 90% can get generated. Still 10x faster and cheaper.
reply