September 24, 2023

Y M L P -324

Powered by Intellect

Still from a demo video showing ACT-1 performing a search on in a browser.

New AI assistant can browse, search, and use web apps like a human

Still from a demo video showing ACT-1 performing a search on in a browser.
Enlarge / Still from a demo video showing ACT-1 performing a search on in a browser when asked to “find me a house.”


Yesterday, California-based AI firm Adept announced Action Transformer (ACT-1), an AI model that can perform actions in software like a human assistant when given high-level written or verbal commands. It can reportedly operate web apps and perform intelligent searches on websites while clicking, scrolling, and typing in the right fields as if it were a person using the computer.

In a demo video tweeted by Adept, the company shows someone typing, “Find me a house in Houston that works for a family of 4. My budget is 600K” into a text entry box. Upon submitting the task, ACT-1 automatically browses in a web browser, clicking the proper regions of the website, typing a search entry, and changing the search parameters until a matching house appears on the screen.

Another demonstration video on Adept’s website shows ACT-1 operating Salesforce with prompts such as “add Max Nye at Adept as a new lead” and “log a call with James Veel saying that he’s thinking about buying 100 widgets.” ACT-1 then clicks the right buttons, scrolls, and fills out the proper forms to finish these tasks. Other demo videos show ACT-1 navigating Google Sheets, Craigslist, and Wikipedia through a browser.

An Adept promotional video showing ACT-1 operating Google Sheets, a web-based spreadsheet app.

How is this possible? Adept describes ACT-1 as a “large-scale transformer.” In AI, a transformer model is a type of neural network that learns to do something by training on example data, and it builds knowledge of the context and relationships between items in the data set. Transformers have been behind many recent AI innovations, including language models like GPT-3 that can write at a nearly human level.

In the case of ACT-1, the training data apparently came from humans operating the software first, and the AI model learned from that. Someone who identified themselves as a developer for ACT-1 on Hacker News wrote, “We used a combination of human demonstrations and feedback data! You need custom software both to record the demonstrations and to represent the state of the tool in a model-consumable way.

After training, the ACT-1 model interacts with a web browser through a Chrome extension that can “observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling,” according to Adept. The company describes ACT -1’s observation ability as being able to generalize across websites, so rules learned on one site can apply to others.

While scripts to automate browsing already exist (and are often used to power bots with ill intentions), the powerful, generalized nature of ACT-1 implied in the demos seems to take machine automation to a new level. Already, people on Twitter are both seriously and half-jokingly raising alarms over the potential for misuse that this technology could bring. Should we allow an intelligent system to have this much control over our computer interfaces?

While those concerns are purely hypothetical for now—especially since ACT-1 does not operate autonomously—they’re something to keep in mind as we rush headlong toward generalized human-level AI that can interface with the outside world through the Internet. Adept even references this goal on its website, writing, “We believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer.”