The Allen Institute for Artificial Intelligence (Ai2), a research nonprofit, is releasing a family of open-source multimodal language models, called Molmo, that it says perform as well as top proprietary models from OpenAI, Google, and Anthropic. 

The organization claims that its biggest Molmo model, which has 72 billion parameters, outperforms OpenAI’s GPT-4o, which is estimated to have over a trillion parameters, in tests that measure things like understanding images, charts, and documents.  

Meanwhile, Ai2 says a smaller Molmo model, with 7 billion parameters, comes close to OpenAI’s state-of-the-art model in performance, an achievement it ascribes to vastly more efficient data collection and training methods. 

What Molmo shows is that open-source AI development is now on par with closed, proprietary models, says Ali Farhadi, the CEO of Ai2. And open-source models have a significant advantage, as their open nature means other people can build applications on top of them. The Molmo demo is available here, and it will be available for developers to tinker with on the Hugging Face website. (Certain elements of the most powerful Molmo model are still shielded from view.) 

Other large multimodal language models are trained on vast data sets containing billions of images and text samples that have been hoovered from the internet, and they can include several trillion parameters. This process introduces a lot of noise to the training data and, with it, hallucinations, says Ani Kembhavi, a senior director of research at Ai2. In contrast, Ai2’s Molmo models have been trained on a significantly smaller and more curated data set containing only 600,000 images, and they have between 1 billion and 72 billion parameters. This focus on high-quality data, versus indiscriminately scraped data, has led to good performance with far fewer resources, Kembhavi says.

Related work from others:  Latest from MIT Tech Review - The coolest thing about smart glasses is not the AR. It’s the AI.

Ai2 achieved this by getting human annotators to describe the images in the model’s training data set in excruciating detail over multiple pages of text. They asked the annotators to talk about what they saw instead of typing it. Then they used AI techniques to convert their speech into data, which made the training process much quicker while reducing the computing power required. 

These techniques could prove really useful if we want to meaningfully govern the data that we use for AI development, says Yacine Jernite, who is the machine learning and society lead at Hugging Face, and was not involved in the research. 

“It makes sense that in general, training on higher-quality data can lower the compute costs,” says Percy Liang, the director of the Stanford Center for Research on Foundation Models, who also did not participate in the research. 

Another impressive capability is that the model can “point” at things, meaning it can analyze elements of an image by identifying the pixels that answer queries.

In a demo shared with MIT Technology Review, Ai2 researchers took a photo outside their office of the local Seattle marina and asked the model to identify various elements of the image, such as deck chairs. The model successfully described what the image contained, counted the deck chairs, and accurately pinpointed to other things in the image as the researchers asked. It was not perfect, however. It could not locate a specific parking lot, for example. 

Other advanced AI models are good at describing scenes and images, says Farhadi. But that’s not enough when you want to build more sophisticated web agents that can interact with the world and can, for example, book a flight. Pointing allows people to interact with user interfaces, he says. 

Related work from others:  Latest from MIT Tech Review - These new tools could help protect our pictures from AI

Jernite says Ai2 is operating with a greater degree of openness than we’ve seen from other AI companies. And while Molmo is a good start, he says, its real significance will lie in the applications developers build on top of it, and the ways people improve it.

Farhadi agrees. AI companies have drawn massive, multitrillion-dollar investments over the past few years. But in the past few months, investors have expressed skepticism about whether that investment will bring returns. Big, expensive proprietary models won’t do that, he argues, but open-source ones can. He says the work shows that open-source AI can also be built in a way that makes efficient use of money and time. 

“We’re excited about enabling others and seeing what others would build with this,” Farhadi says. 

Share via
Copy link
Powered by Social Snap