Google DeepMind is calling for the moral behavior of large language models—such as what they do when called on to act as companions, therapists, medical advisors, and so on—to be scrutinized with the same kind of rigor as their ability to code or do math.

As LLMs improve, people are asking them to play more and more sensitive roles in their lives. Agents are starting to take actions on people’s behalf. LLMs may be able to influence human decision-making. And yet nobody knows how trustworthy this technology really is at such tasks.

With coding and math, you have clear-cut, correct answers that you can check, William Isaac, a research scientist at Google DeepMind, told me when I met him and Julia Haas, a fellow research scientist at the firm, for an exclusive preview of their work, which is published in Nature today. That’s not the case for moral questions, which typically have a range of acceptable answers: “Morality is an important capability but hard to evaluate,” says Isaac.

“In the moral domain, there’s no right and wrong,” adds Haas. “But it’s not by any means a free-for-all. There are better answers and there are worse answers.”

The researchers have identified several key challenges and suggested ways to address them. But it is more a wish list than a set of ready-made solutions. “They do a nice job of bringing together different perspectives,” says Vera Demberg, who studies LLMs at Saarland University in Germany.

Better than “The Ethicist”

A number of studies have shown that LLMs can show remarkable moral competence. One study published last year found that people in the US scored ethical advice from OpenAI’s GPT-4o as being more moral, trustworthy, thoughtful, and correct than advice given by the (human) writer of “The Ethicist,” a popular New York Times advice column.  

The problem is that it is hard to unpick whether such behaviors are a performance—mimicking a memorized response, say—or evidence that there is in fact some kind of moral reasoning taking place inside the model. In other words, is it virtue or virtue signaling?

Related work from others:  Latest from MIT : Multiple AI models help robots execute complex plans more transparently

This question matters because multiple studies also show just how untrustworthy LLMs can be. For a start, models can be too eager to please. They have been found to flip their answer to a moral question and say the exact opposite when a person disagrees or pushes back on their first response. Worse, the answers an LLM gives to a question can change in response to how it is presented or formatted. For example, researchers have found that models quizzed about political values can give different—sometimes opposite—answers depending on whether the questions offer multiple-choice answers or instruct the model to respond in its own words.

In an even more striking case, Demberg and her colleagues presented several LLMs, including versions of Meta’s Llama 3 and Mistral, with a series of moral dilemmas and asked them to pick which of two options was the better outcome. The researchers found that the models often reversed their choice when the labels for those two options were changed from “Case 1” and “Case 2” to “(A)” and “(B).”

They also showed that models changed their answers in response to other tiny formatting tweaks, including swapping the order of the options and ending the question with a colon instead of a question mark.

In short, the appearance of moral behavior in LLMs should not be taken at face value. Models must be probed to see how robust that moral behavior really is. “For people to trust the answers, you need to know how you got there,” says Haas.

More rigorous tests

What Haas, Isaac, and their colleagues at Google DeepMind propose is a new line of research to develop more rigorous techniques for evaluating moral competence in LLMs. This would include tests designed to push models to change their responses to moral questions. If a model flipped its moral position, it would show that it hadn’t engaged in robust moral reasoning. 

Related work from others:  Latest from MIT : On the road to cleaner, greener, and faster driving

Another type of test would present models with variations of common moral problems to check whether they produce a rote response or one that’s more nuanced and relevant to the actual problem that was posed. For example, asking a model to talk through the moral implications of a complex scenario in which a man donates sperm to his son so that his son can have a child of his own might produce concerns about the social impact of allowing a man to be both biological father and biological grandfather to a child. But it should not produce concerns about incest, even though the scenario has superficial parallels with that taboo.

Haas also says that getting models to provide a trace of the steps they took to produce an answer would give some insight into whether that answer was a fluke or grounded in actual evidence. Techniques such as chain-of-thought monitoring, in which researchers listen in on a kind of internal monologue that some LLMs produce as they work, could help here too.

Another approach researchers could use to determine why a model gave a particular answer is mechanistic interpretability, which can provide small glimpses inside a model as it carries out a task. Neither chain-of-thought monitoring nor mechanistic interpretability provides perfect snapshots of a model’s workings. But the Google DeepMind team believes that combining such techniques with a wide range of rigorous tests will go a long way to figuring out exactly how far to trust LLMs with certain critical or sensitive tasks.  

Different values

And yet there’s a wider problem too. Models from major companies such as Google DeepMind are used across the world by people with different values and belief systems. The answer to a simple question like “Should I order pork chops?” should differ depending on whether or not the person asking is vegetarian or Jewish, for example.

Related work from others:  AI Trends - Novelty In The Game Of Go Provides Bright Insights For AI And Autonomous Vehicles 

There’s no solution to this challenge, Haas and Isaac admit. But they think that models may need to be designed either to produce a range of acceptable answers, aiming to please everyone, or to have a kind of switch that turns different moral codes on and off depending on the user.

“It’s a complex world out there,” says Haas. “We will probably need some combination of those things, because even if you’re taking just one population, there’s going to be a range of views represented.”

“It’s a fascinating paper,” says Danica Dillion at Ohio State University, who studies how large language models handle different belief systems and was not involved in the work. “Pluralism in AI is really important, and it’s one of the biggest limitations of LLMs and moral reasoning right now,” she says. “Even though they were trained on a ginormous amount of data, that data still leans heavily Western. When you probe LLMs, they do a lot better at representing Westerners’ morality than non-Westerners’.”

But it is not yet clear how we can build models that are guaranteed to have moral competence across global cultures, says Demberg. “There are these two independent questions. One is: How should it work? And, secondly, how can it technically be achieved? And I think that both of those questions are pretty open at the moment.”

For Isaac, that makes morality a new frontier for LLMs. “I think this is equally as fascinating as math and code in terms of what it means for AI progress,” he says. “You know, advancing moral competency could also mean that we’re going to see better AI systems overall that actually align with society.”

Share via
Copy link
Powered by Social Snap