Yesterday, California-based AI firm Adept announced Action Transformer (ACT-1), an AI model that can perform actions in software like a human assistant when given high-level written or verbal commands. It can reportedly operate web apps and perform intelligent searches on websites while clicking, scrolling, and typing in the right fields as if it were a person using the computer.
In a demo video tweeted by Adept, the company shows someone typing, “Find me a house in Houston that works for a family of 4. My budget is 600K” into a text entry box. Upon submitting the task, ACT-1 automatically browses Redfin.com in a web browser, clicking the proper regions of the website, typing a search entry, and changing the search parameters until a matching house appears on the screen.
1/7 We built a new model! It’s called Action Transformer (ACT-1) and we taught it to use a bunch of software tools. In this first video, the user simply types a high-level request and ACT-1 does the rest. Read on to see more examples ⬇️ pic.twitter.com/mq7c0Vyd7N
— Adept (@AdeptAILabs) September 14, 2022
Another demonstration video on Adept’s website shows ACT-1 operating Salesforce with prompts such as “add Max Nye at Adept as a new lead” and “log a call with James Veel saying that he’s thinking about buying 100 widgets.” ACT-1 then clicks the right buttons, scrolls, and fills out the proper forms to finish these tasks. Other demo videos show ACT-1 navigating Google Sheets, Craigslist, and Wikipedia through a browser.
How is this possible? Adept describes ACT-1 as a “large-scale transformer.” In AI, a transformer model is a type of neural network that learns to do something by training on example data, and it builds knowledge of the context and relationships between items in the data set. Transformers have been behind many recent AI innovations, including language models like GPT-3 that can write at a nearly human level.
In the case of ACT-1, the training data apparently came from humans operating the software first, and the AI model learned from that. Someone who identified themselves as a developer for ACT-1 on Hacker News wrote, “We used a combination of human demonstrations and feedback data! You need custom software both to record the demonstrations and to represent the state of the tool in a model-consumable way.“
After training, the ACT-1 model interacts with a web browser through a Chrome extension that can “observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling,” according to Adept. The company describes ACT -1’s observation ability as being able to generalize across websites, so rules learned on one site can apply to others.
While scripts to automate browsing already exist (and are often used to power bots with ill intentions), the powerful, generalized nature of ACT-1 implied in the demos seems to take machine automation to a new level. Already, people on Twitter are both seriously and half-jokingly raising alarms over the potential for misuse that this technology could bring. Should we allow an intelligent system to have this much control over our computer interfaces?
While those concerns are purely hypothetical for now—especially since ACT-1 does not operate autonomously—they’re something to keep in mind as we rush headlong toward generalized human-level AI that can interface with the outside world through the Internet. Adept even references this goal on its website, writing, “We believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer.”