Show HN: Magnitude – Open-source AI browser automation framework

upvote

Show HN: Magnitude – Open-source AI browser automation framework

(github.com)

106 points

by anerli18 hours ago |

upvote

by rozap14 hours ago|

[-]

There are a number of these out there, and this one has a super easy setup and appears to Just Work, so nice job on that. I had it going and producing plausible results within a minute or so.

One thing I'm wondering is if there's anyone doing this at scale? The issue I see is that with complex workflows which take several dozen steps and have complex control flow, the probability of reaching the end falls off pretty hard, because if each step has a .95 chance of completing successfully, after not very many steps you have a pretty small overall probability of success. These use cases are high value because writing a traditional scraper is a huge pain, but we just don't seem to be there yet.

The other side of the coin is simple workflows, but those tend to be the workflows where writing a scraper is pretty trivial. This did work, and I told it to search for a product at a local store, but the program cost $1.05 to run. So doing it at any scale quickly becomes a little bit silly.

So I guess my question is: who is having luck using these tools, and what are you using them for?

One route I had some success with is writing a DSL for scraping and then having the llm generate that code, then interpreting it and editing it when it gets stuck. But then there's the "getting stuck detection" part which is hard etc etc.

reply

upvote

by anerli14 hours ago|

[-]

Glad you were able to get it set up quickly!

We currently are optimizing for reliability and quality, which is why we suggest Claude - but it can get expensive in some cases. Using Qwen 2.5-VL-72B will be significantly cheaper, though may not be always reliable.

Most of our usage right now is for running test cases, and people seem to often prefer qwen for that use case - since typically test cases are clearer how to execute.

Something that is top of mind for is is figuring out a good way to "cache" workflows that get taken. This way you can repeat automations either with no LLM or with a smaller/cheap LLM. This will would enable deterministic, repeatable flows, that are also very affordable and fast. So even if each step on the first run is only 95% reliable - if it gets through it, it could repeat it with 100% reliability.

reply

upvote

by TheTaytay11 hours ago|

[-]

I am desperately waiting for someone to write exactly this! Use the LLM to write the repeatable, robust script. If the script fails, THEN fall back to an LLM to recover and fix the script.

reply

upvote

by pzo10 hours ago|

[-]

Yes I wish we could combine browser use, stagehand, director.ai, playwright. Even better where I can record my session with mouse movements, clicks, dom inspect, screen sharing and my voice talk and explain what I want to do. Then llm generating scraper for different task and recovering if some scraping task got broken at some point.

reply

upvote

by mertunsall5 hours ago|

[-]

https://github.com/browser-use/workflow-use

reply

upvote

by anerli11 hours ago|

[-]

Yeah, I think its a little tricky to do this well + automatically but is essentially our goal - not necessarily literally writing a script but storing the actions taken by the LLM and being able to repeat them, and adapt only when needed

reply

upvote

by mertunsall5 hours ago|

[-]

In browser-use, we combine vision + browser extraction and we find that this gives the most reliable agent: https://github.com/browser-use/browser-use :)

We recently gave the model access to a file system so that it never forgets what it's supposed to do - we already have ton of users very happy with recent reliability updates!

We also have a beta workflow-use, which is basically what's mentioned in the comments here to "cache" a workflow: https://github.com/browser-use/workflow-use

Let us know what you guys think - we are shipping hard and fast!

reply

upvote

by dataviz100010 hours ago|

[-]

Hey guys, I got a question.

I've been working on a Chrome extension with a side panel. Think about it like the side panel copilot in VSCode, Cursor, or Windsurf. Currently it is automating workflows but those are hard coded. I've started working on a more generalized automation using langchain. Looking at your code is helpful because I can in only a few hundred lines of code recreate a huge portion Playwright's capabilities in a Chrome extension side panel so I should be able to port it to the Chrome extension. That is, I'm creating a tools like mouse click, type, mouse move, open tab, navigate, wait for element, ect..

Looking at your code, I'm thinking about pulling anything that isn't coupled to node while mapping all the Playwright capabilities to the equivalent in a Chrome extension. It's busy work.

If I do that why would I prefer using .baml over the equivalent langchain? What's the differnce? Am I'm comparing apples to oranges? I'm not worried about using langgraph because I should be able to get most of the functionality with xstate v5 [0] plus serialized portable JSON state graphs so I can store custom graphs on a remote server that can be queried by API.

That is my question. I don't see langchain in the dependencies which is cool, but why .baml? Also, what am I'm missing going down this thought path?

[0] https://chatgpt.com/share/685dfc60-106c-8004-bbd0-1ba3a33aba...

reply

upvote

by anerli10 hours ago|

[-]

Hey, curious about your use cases for a chrome extension, care to share more?

To answer your question - BAML is as DSL that helps to define prompts, organize context, and to get better performance on structured output from the LLM. In theory you should be able to map over similar logic to other clients.

reply

upvote

by pzo10 hours ago|

[-]

Chrome extension has advantage of user friendly distribution - so that non tech savy users can also do automation. I'm also looking for automation for mobile devices (app webview or safari mobile) and because of platform limitation also this doesn't seem can by anytime extended to mobile devices

reply

upvote

by dataviz10007 hours ago|

[-]

In 2018, I helped the NFL front offices and the ticket brokers who bought wholesale in blocks of 10k manage event tickets, 100s of thousands of tickets, across secondary marketplaces, e.g. Stubhub and SeatGeek, because their primary marketplace, Ticketmaster, was very slow to develop an API that helped them import the data, barcodes, into the secondary markets and to remove the ticket from being listed if it was in a secondary market checkout or sold preventing millions of dollars worth of double sold tickets. The problem was Ticketmaster for legal reasons couldn't give us preferential access so I was always updating anytime they changed their antibot protections. I created a Chrome extension as a backup incase they blocked the automated browsers on a Friday night which was side loaded and did everything the Puppeteer agents were doing to buy me time. It was a perfect stopgap. The users would press a button and watch it automatically navigate to pages and handle their workflow in their browser moving lightening fast.

You can do most anything you can do in Playwright, navigate, open new tabs, scroll with the added benefit of keeping the human in the loop. Conceptually they are exactly the same, I can go into that more if you want. Most of the limitations are security features. However, for automated workflows, the security features should be heeded for good reason. For example, chatgpt console require isTrusted to be true rejecting synthetic events so it is impossible to automate the chatgpt console without workarounds which they will likely close. That is the biggest limitation. On the other hand, there are 3 billion Chrome users and they can download the extension with a single click. Bypassing the security features like requiring a human interaction button press or mouse click to go fullscreen, play sound, or transfer money on a bank website shouldn't be alowed. If the use case requires that, use Playwright or a BrowserWindow in an electron application. A Chrome extension with a side panel can collect every element using stacking context that is visible to limit the amount of data processed by a LLM, it can capture all the inner text of a page, it can read every single fetch and XMLHttpRequests which is a very good way to get data without loading tons of markup, it can make fetch and XMLHttpRequests in the MAIN world so they automatically contain all the cookies, it can use huggingface/transformers.js to transcribe audio, video to text with openai whisper or perform ocr image to text on webgpu, if available.

I can systematically analyze, poke, and prod thousands of websites running with playwright in the cloud to discover all the capabilities and automatically create workflows with xstate v5 which are sent to the Chrome extension in JSON. For example, I can automatically navigate to a website, find all the inputs, try several ways to inject text, use image to text to test if the text is added to the field to add to the list of capabilities. So if a user is on the page, I can automate the workflow or notify the user they need to take a step.

I think the best idea is to have curated workflows and curated data embeddings to target focused industries. It can automate navigating the browser to MLS and zillow.com, collected information, inject it into google sheets office 365 excel, export it, navigate to email, write information, attach the file to the email, and send it all with the human in the loop. Moreover, if it does 95% of the work, I don't think humans will mind pressing a button or taking an action when prompted. The question is will people prefer this instead of fully automated running somewhere in the cloud? How do you feel about using a code assistant? Do you like being in the loop?

This is all experimental. The gif has a good example of a side panel automating stock option trading. I'm going to try and inject your code to see if I can start to develop systematic generalized automation with it. [0] [1]

[0] https://github.com/adam-s/doomberg-terminal

[1] https://github.com/adam-s/doomberg-terminal/tree/main/docs/m...

reply

upvote

by ewired12 hours ago|

[-]

It was interesting to find out that Qwen 2.5 VL can output coordinates like Sonnet 4, or does that use a different implementation?

reply

upvote

by anerli12 hours ago|

[-]

Both of them are "visually grounded" - meaning if you ask for the location of something in an image - they can output the exact x/y pixel coordinates! Not many models can do this, especially not many that are large enough to actually reason through sequences of actions well

reply

upvote

by grbsh17 hours ago|

[-]

Why not just use Claude by itself? Opus and Sonnet are great at producing pixel coordinates and tool usages from screenshots of UIs. Curious as to what your framework gives me over the plain base model.

reply

upvote

by anerli17 hours ago|

[-]

Hey! To have a framework that can effectively control browser agents, you need systems to interact with the browser, but also pass relevant content from the page to the LLM. Our framework manages this agent loop in a way that enables flexible agentic execution that can mix with your own code - giving you control but in a convenient way. Claude and OpenAI computer use APIs/loops are slower, more expensive, and tailored for a limited set of desktop automation use cases rather than robust browser automations.

reply

upvote

by axlee14 hours ago|

[-]

Using this for testing instead of regular playwright must 10000x the cost and speed, doesn't it? At which points do the benefits outweigh the costs?

reply

upvote

by anerli14 hours ago|

[-]

I think depends a lot on how much you value your own time, since its quite time consuming to write and update playwright scripts. It's gonna save you developer hours to write automations using natural language rather than messing around with and fixing selectors. It's also able to handle tasks that playwright wouldn't be able to do at all - like extracting structured data from a messy/ambiguous DOM and adapting automatically to changing situations.

You can also use cheaper models depending on your needs, for example Qwen 2.5 VL 72B is pretty affordable and works pretty well for most situations.

reply

upvote

by plufz14 hours ago|

[-]

But we can use an LLM to write that script though and give that agent access to a browser to find DOM selectors etc. And than we have a stable script where we, if needed, manually can fix any LLM bugs just once…? I’m sure there are use cases with messy selectors as you say, but for me it feels like most cases are better covered by generating scripts.

reply

upvote

by anerli13 hours ago|

[-]

Yeah we've though about this approach a lot - but the problem is if your final program is a brittle script, you're gonna need a way to fix it again often - and then you're still depending on recurrently using LLMs/agents. So we think its better to have the program itself be resilient to change instead of you/your LLM assistant having to constantly ensure the program is working.

reply

upvote

by adenta11 hours ago|

[-]

I wonder if a nice middle ground would be: - recording the playwright behind the scenes and storing - trying that as a “happy path” first attempt to see if it passes - if it doesn’t pass, rebuilding it with the AI and vision models

Best of both worlds. The playwright is more of a cache than a test

reply

upvote

by anerli10 hours ago|

[-]

I think the difficulty with this approach is (1) you want a good "lookup" mechanism - given a task, how do you know what cache should be loaded? you can do a simple string lookup based on the task content, but when the task might include parameters or data, or be a part of a bigger workflow, it gets trickier. (2) you need a good way to detect when to adapt / fall back to the LLM. When the cache is only a playwright script, it can be difficult to know when it falls out of the existing trajectory. You can check for selector timeouts and things, but you might be missing a lot of false negatives.

reply

upvote

by lyime7 hours ago|

[-]

Are you sure? Couldnt you just just go back to the LLM if the script breaks? Pages changes but not that often in general.

It seems like a hybrid approach would scale better and be significantly cheaper.

reply

upvote

by anerli6 hours ago|

[-]

We do believe in a hybrid approach where a fast/deterministic representation is saved - but think there is a more seamless way were the framework itself is high level and manages these details by caching the underlying actions that can run

reply

upvote

by tnolet6 hours ago|

[-]

I think you are overstating. Just use Playwright codegen. No need for manual test writing, or at least 90% can get generated. Still 10x faster and cheaper.

reply

upvote

by sylware3 hours ago|

[-]

Wow, I guess this could be significant for the humans of click/view/account creation farms.

reply

upvote

by mountainriver10 hours ago|

[-]

How many of these are there now?

reply

upvote

by anerli10 hours ago|

[-]

Only one that's worth using ;)

reply

upvote

by 10yearsalurker5 hours ago|

[-]

Pop Pop! (Sorry, I just couldn’t resist)

reply

upvote

by jachee4 hours ago|

[-]

Someone had to. ;)

reply

upvote

by KeysToHeaven17 hours ago|

[-]

Finally, a browser agent that doesn’t panic at the sight of a canvas

reply

upvote

by anerli17 hours ago|

[-]

Exactly :)

reply

upvote

by revskill16 hours ago|

[-]

Not sure about this because you're the author.

reply

upvote

by TheTaytay11 hours ago|

[-]

It's obvious this is the OP though. They are allowed to respond to favorable comments.

reply

upvote

by anerli16 hours ago|

[-]

Try it out and report back!

reply

upvote

by revskill15 hours ago|

[-]

No

reply

upvote

by legucy15 hours ago|

[-]

Classic new age hacker news hostility. Do you think this response adds anything?

reply

upvote

by owebmaster12 hours ago|

[-]

I do, cheap praise doesn't benefit the community and it might be astroturf. Constructive criticism would be more valuable - there are multiple similar projects like this posted here daily, and this one likely isn't the best.

reply

upvote

by anerli11 hours ago|

[-]

For context, we have no affiliation with KeysToHeaven (though we appreciate his comment). We do think our vision-first approach gives us a significant edge over other browser agents, though we probably could’ve made that aspect clearer in the title

reply

upvote

by Abubaker76117 hours ago|

[-]

[dead]

reply