Latest from MIT Tech Review – Multimodal: AI’s new frontier

Multimodality is a relatively new term for something extremely old: how people have learned about the world since humanity appeared. Individuals receive information from myriad sources via their senses, including sight, sound, and touch. Human brains combine these different modes of data into a highly nuanced, holistic picture of reality.

“Communication between humans is multimodal,” says Jina AI CEO Han Xiao. “They use text, voice, emotions, expressions, and sometimes photos.” That’s just a few obvious means of sharing information. Given this, he adds, “it is very safe to assume that future communication between human and machine will also be multimodal.”

A technology that sees the world from different angles

We are not there yet. The furthest advances in this direction have occurred in the fledgling field of multimodal AI. The problem is not a lack of vision. While a technology able to translate between modalities would clearly be valuable, Mirella Lapata, a professor at the University of Edinburgh and director of its Laboratory for Integrated Artificial Intelligence, says “it’s a lot more complicated” to execute than unimodal AI.

DOWNLOAD THE REPORT

In practice, generative AI tools use different strategies for different types of data when building large data models—the complex neural networks that organize vast amounts of information. For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. As Henry Ajder, founder of AI consultancy Latent Space, puts it, this involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

Artificial Intelligence

Latest from MIT Tech Review – This AI-generated Minecraft may represent the future of real-time video generation

When you walk around in a version of the video game Minecraft from the AI companies Decart and Etched, it feels a little off. Sure, you can move forward, cut down a tree, and lay down a dirt block, just like in the real thing. If you turn around, though, the dirt block you just…

Artificial Intelligence

Latest from Google AI – Autonomous visual information seeking with large language models

Posted by Ziniu Hu, Student Researcher, and Alireza Fathi, Research Scientist, Google Research, Perception Team There has been great progress towards adapting large language models (LLMs) to accommodate multimodal inputs for tasks including image captioning, visual question answering (VQA), and open vocabulary recognition. Despite such achievements, current state-of-the-art visual language models (VLMs) perform inadequately on…

Artificial Intelligence

Latest from MIT : Celebrating Kendall Square’s past and shaping its future

Kendall Square’s community took a deep dive into the history and future of the region at the Kendall Square Association’s 15th annual meeting on Oct. 19. It’s no secret that Kendall Square, located in Cambridge, Massachusetts, moves fast. The event, titled “Looking Back, Looking Ahead,” gave community members a chance to pause and reflect on…

Artificial Intelligence

Latest from MIT : MIT Lincoln Laboratory wins nine R&D 100 Awards for 2021

Nine technologies developed at MIT Lincoln Laboratory have been selected as R&D 100 Award winners for 2021. Since 1963, this awards program has recognized the 100 most significant technologies transitioned to use or introduced into the marketplace over the past year. The winners are selected by an independent panel of expert judges. R&D World, an…

Artificial Intelligence

Latest from Google AI – Alternating updates for efficient transformers

Posted by Xin Wang, Software Engineer, and Nishanth Dikkala, Research Scientist, Google Research Contemporary deep learning models have been remarkably successful in many domains, ranging from natural language to computer vision. Transformer neural networks (transformers) are a popular deep learning architecture that today comprise the foundation for most tasks in natural language processing and also…

Artificial Intelligence

Latest from Google AI – Rewriting Image Captions for Visual Question Answering Data Creation

Posted by Soravit Beer Changpinyo and Doron Kukliansky‎, Senior Software Engineers, Google Research Visual Question Answering (VQA) is a useful machine learning (ML) task that requires a model to answer a visual question about an image. What makes it challenging is its multi-task and open-ended nature; it involves solving multiple technical research questions in computer…

A technology that sees the world from different angles

Similar Posts