Latest from MIT Tech Review – Multimodal: AI’s new frontier

Multimodality is a relatively new term for something extremely old: how people have learned about the world since humanity appeared. Individuals receive information from myriad sources via their senses, including sight, sound, and touch. Human brains combine these different modes of data into a highly nuanced, holistic picture of reality.

“Communication between humans is multimodal,” says Jina AI CEO Han Xiao. “They use text, voice, emotions, expressions, and sometimes photos.” That’s just a few obvious means of sharing information. Given this, he adds, “it is very safe to assume that future communication between human and machine will also be multimodal.”

A technology that sees the world from different angles

We are not there yet. The furthest advances in this direction have occurred in the fledgling field of multimodal AI. The problem is not a lack of vision. While a technology able to translate between modalities would clearly be valuable, Mirella Lapata, a professor at the University of Edinburgh and director of its Laboratory for Integrated Artificial Intelligence, says “it’s a lot more complicated” to execute than unimodal AI.

DOWNLOAD THE REPORT

In practice, generative AI tools use different strategies for different types of data when building large data models—the complex neural networks that organize vast amounts of information. For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. As Henry Ajder, founder of AI consultancy Latent Space, puts it, this involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

Artificial Intelligence

Latest from MIT Tech Review – Humans at the heart of generative AI

It’s a stormy holiday weekend, and you’ve just received the last notification you want in the busiest travel week of the year: the first leg of your flight is significantly delayed. You might expect this means you’ll be sitting on hold with airline customer service for half an hour. But this time, the process looks…

Artificial Intelligence

Latest from MIT : 3 Questions: Honing robot perception and mapping

Walking to a friend’s house or browsing the aisles of a grocery store might feel like simple tasks, but they in fact require sophisticated capabilities. That’s because humans are able to effortlessly understand their surroundings and detect complex information about patterns, objects, and their own location in the environment. What if robots could perceive their…

Artificial Intelligence

O’Reilly Media – AI Essentials for Tech Executives

On April 24, O’Reilly Media will be hosting Coding with AI: The End of Software Development as We Know It—a live virtual tech conference spotlighting how AI is already supercharging developers, boosting productivity, and providing real value to their organizations. If you’re in the trenches building tomorrow’s development practices today and interested in speaking at…

Artificial Intelligence

Latest from MIT : Building better batteries, faster

To help combat climate change, many car manufacturers are racing to add more electric vehicles in their lineups. But to convince prospective buyers, manufacturers need to improve how far these cars can go on a single charge. One of their main challenges? Figuring out how to make extremely powerful but lightweight batteries. Typically, however, it…

Artificial Intelligence

Latest from MIT : Cognitive scientists develop new model explaining difficulty in language comprehension

Cognitive scientists have long sought to understand what makes some sentences more difficult to comprehend than others. Any account of language comprehension, researchers believe, would benefit from understanding difficulties in comprehension. In recent years researchers successfully developed two models explaining two significant types of difficulty in understanding and producing sentences. While these models successfully predict…

Artificial Intelligence

Latest from MIT Tech Review – How AI can help supercharge creativity

Sometimes Lizzie Wilson shows up to a rave with her AI sidekick. One weeknight this past February, Wilson plugged her laptop into a projector that threw her screen onto the wall of a low-ceilinged loft space in East London. A small crowd shuffled in the glow of dim pink lights. Wilson sat down and started…

A technology that sees the world from different angles

Similar Posts