MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

When ChatGPT was first released, everyone in AI was talking about the new generation of AI assistants. But over the past year, that excitement has turned to a new target: AI agents. 

Agents featured prominently in Google’s annual I/O conference in May, when the company unveiled its new AI agent called Astra, which allows users to interact with it using audio and video. OpenAI’s new GPT-4o model has also been called an AI agent.  

And it’s not just hype, although there is definitely some of that too. Tech companies are plowing vast sums into creating AI agents, and their research efforts could usher in the kind of useful AI we have been dreaming about for decades. Many experts, including Sam Altman, say they are the next big thing.   

But what are they? And how can we use them? 

How are they defined? 

It is still early days for research into AI agents, and the field does not have a definitive definition for them. But simply, they are AI models and algorithms that can autonomously make decisions in a dynamic world, says Jim Fan, a senior research scientist at NVIDIA who leads the company’s AI agents initiative. 

The grand vision for AI agents is a system that can execute a vast range of tasks, much like a human assistant can. In the future, it could help you book your vacation, but it will also remember if you prefer swanky hotels, so it will only suggest hotels that have four stars or more, then go ahead and book the one you pick from the range of options it offers you. It will then also suggest flights that work best with your calendar, and plan the itinerary for your trip based on your preferences. It could make a list of things to pack based on that plan and the weather forecast. It might even send your itinerary to any friends it knows live in your destination, and invite them along. In the workplace, it  could analyze your to-do list and execute tasks from it, such as sending calendar invites, memos or emails. 

One vision for agents is that they are multimodal, meaning they can process language, audio and video. For example in Google’s Astra demo, users could point their smartphone cameras at things and ask the agent questions. The agent could respond to inputs across text, audio and video. 

Related work from others:  Latest from Google AI - Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization

These agents could also make processes smoother for businesses and public organizations, says David Barber, the director of the University College London Centre for Artificial Intelligence. For example, an AI agent might be able to function as a more sophisticated customer service bot. The current generation of language model-based assistants can only generate the next likely word in a sentence. But an AI agent would have the ability to act on natural language commands autonomously, and process customer service tasks without supervision. For example, the agent will be able to analyze customer complaint emails, and then know it needs to check the customer’s reference number, access databases such as customer relationship management and delivery systems to see whether the complaint is legitimate, and process it according to the company’s policies, Barber says. 

Broadly speaking, there are two different categories of agents: Software agents and embodied agents, says Fan. 

Software agents run on computers or mobile phones and use apps, much like the travel agent example above. “Those agents are very useful for office work or sending emails or having this chain of events going on,” he says. 

Embodied agents are agents that are situated in a 3D world, such as a video game or in a robot. These kinds of embodied agents might make video games more engaging as people can play with non-player characters that are controlled by AI. These sorts of agents could also help build more useful robots that could help us with everyday tasks at home, such as folding laundry and cooking meals. 

Fan was part of a team that built an embodied AI agent called MineDojo in the popular computer game Minecraft. Using a vast trove of data collected from the internet, Fan’s AI agent was able to learn new skills and tasks that allowed it to freely explore the virtual 3D world, and complete complex tasks such as encircling llamas with fences or scooping lava into a bucket. Video games are good proxies for the real world, as they require agents to understand physics, reasoning and common sense. 

In a new paper, which has not yet been peer-reviewed, researchers at Princeton say that AI agents tend to have three different characteristics. AI systems are considered “agentic” if they can pursue difficult goals without being instructed in complex environments. They also qualify if they can be instructed in natural language, and act autonomously without supervision. And finally, the term ‘agent’ can also apply to systems that are able to use tools, such as web search or programming, or are capable of planning. 

Related work from others:  Latest from Google AI - On-device diffusion plugins for conditioned text-to-image generation

Are they a new thing?

The term ‘AI agents’ has been around for years, and has meant different things at different times, says Chirag Shah, a computer science professor at the University of Washington. 

There have been two waves of agents, says Fan. The current wave is thanks to the language model boom and the rise of systems such as ChatGPT. 

The previous wave was in 2016 when Google DeepMind introduced AlphaGo, its AI system that can play—and win—the game Go. AlphaGo was able to make decisions and plan strategies. This relied on reinforcement learning, a technique that rewards AI algorithms for desirable behaviors. 

“But these agents were not general,” says Oriol Vinyals, vice president of research at Google DeepMind. They were created for very specific tasks—in this case playing the game Go. The new generation of foundation model-based AI makes agents more universal, as they can learn from the world humans interact with. 

“You feel much more that the model is interacting with the world and then giving back to you better answers or better assisted assistance or whatnot,” says Vinyals. 

What are the limitations? 

There are still many open questions that need to be answered. Kanjun Qiu, the CEO and founder of AI startup Imbue, which is working on agents that can reason and code, likens the state of agents to where self-driving cars were just over a decade ago. They can do stuff, but they’re unreliable and still not really autonomous. For example a coding agent can generate code, but it sometimes gets it wrong, and doesn’t know how to test the code it’s creating, says Qiu. So humans still need to be actively involved in the process. AI systems still can’t fully reason, which is a critical step in operating in a complex and  ambiguous human world. 

“We’re nowhere close to having an agent that can just automate all of these chores for us,” says Fan. Current systems “hallucinate and they also don’t always follow instructions closely. And that becomes annoying,” Fan says. 

Another limitation is that, after a while, AI agents lose track of what they are working on. AI systems are limited by their context windows, meaning the amount of data they can take into account at any given time. 

Related work from others:  Latest from MIT : Generative AI imagines new protein structures

“ChatGPT can do coding, but it’s not able to do long-form content well. But for human developers, we look at an entire GitHub repository that has 10s, if not hundreds of lines of code, and we have no trouble navigating it,” says Fan. 

To tackle this problem, Google has increased its models’ capacity to process data, which allows users to have longer interactions with them, and have models remember more about past interactions. The company said it is working on making its context windows infinite in the future.

For embodied agents, such as robots, there are even more limitations. There is a lack of training data to teach them, and researchers are only just starting to harness the power of foundation models in robotics. 

So, amid all the hype and excitement, it’s worth bearing in mind that research into AI agents is still in its very early stages, and it will likely take years until we can experience their full potential. 

That sounds cool. But can I try an AI agent now anyway? 

Sort of. You’ve most likely tried their early prototypes, such as OpenAI’s ChatGPT and GPT-4. “If you’re interacting with software that feels smart, that is kind of an agent,” says Qiu. 

Right now the best agents we have are systems that have very narrow and specific use cases, such as coding, customer service bots or workflow automation software like Zapier, she says. But these are a far cry from a universal AI agent that can do complex tasks. 

“Today we have these computers and they’re really powerful, but we have to micromanage them,” says Qiu. 

OpenAI’s ChatGPT plug-ins, which allow people to create AI-powered assistants for web browsers, were an attempt at agents, says Qiu. But these systems are still clumsy, unreliable, and not capable of reasoning, she says. 

Despite that, these systems will one day change the way we interact with technology, Qiu believes, and it is a trend people need to pay attention to. 

“It’s not like, ‘Oh my God, all of a sudden we have AGI’… but more like, ‘Oh my God, my computer can do way more than it did five years ago,’” she says.

Similar Posts