Latest from MIT Tech Review – Multimodal: AI’s new frontier

Multimodality is a relatively new term for something extremely old: how people have learned about the world since humanity appeared. Individuals receive information from myriad sources via their senses, including sight, sound, and touch. Human brains combine these different modes of data into a highly nuanced, holistic picture of reality.

“Communication between humans is multimodal,” says Jina AI CEO Han Xiao. “They use text, voice, emotions, expressions, and sometimes photos.” That’s just a few obvious means of sharing information. Given this, he adds, “it is very safe to assume that future communication between human and machine will also be multimodal.”

A technology that sees the world from different angles

We are not there yet. The furthest advances in this direction have occurred in the fledgling field of multimodal AI. The problem is not a lack of vision. While a technology able to translate between modalities would clearly be valuable, Mirella Lapata, a professor at the University of Edinburgh and director of its Laboratory for Integrated Artificial Intelligence, says “it’s a lot more complicated” to execute than unimodal AI.

DOWNLOAD THE REPORT

In practice, generative AI tools use different strategies for different types of data when building large data models—the complex neural networks that organize vast amounts of information. For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. As Henry Ajder, founder of AI consultancy Latent Space, puts it, this involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

Artificial Intelligence

Latest from MIT Tech Review – A Cambridge Analytica-style scandal for AI is coming

Can you imagine a car company putting a new vehicle on the market without built-in safety features? Unlikely, isn’t it? But what AI companies are doing is a bit like releasing race cars without seatbelts or fully working brakes, and figuring things out as they go. This approach is now getting them in trouble. For…

Artificial Intelligence

Latest from MIT : The promise and pitfalls of artificial intelligence explored at TEDxMIT event

Scientists, students, and community members came together last month to discuss the promise and pitfalls of artificial intelligence at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) for the fourth TEDxMIT event held at MIT. Attendees were entertained and challenged as they explored “the good and bad of computing,” explained CSAIL Director Professor Daniela Rus,…

Artificial Intelligence

Latest from MIT : Four Lincoln Laboratory technologies win five 2023 R&D 100 awards

Ultrasound that doesn’t require touching patients. A web-based tool that reinvents crew scheduling for the Air Force. Cryptographic hardware that protects sensitive data. And the world’s first practical memory for quantum networking. These four technologies developed at MIT Lincoln Laboratory, either wholly or with collaborators, received 2023 R&D 100 Awards. The ultrasound technology also received…

Artificial Intelligence

Latest from Google AI – Distilling step-by-step: Outperforming larger language models with less training data and smaller model sizes

Posted by Cheng-Yu Hsieh, Student Researcher, and Chen-Yu Lee, Research Scientist, Cloud AI Team Large language models (LLMs) have enabled a new data-efficient learning paradigm wherein they can be used to solve unseen new tasks via zero-shot or few-shot prompting. However, LLMs are challenging to deploy for real-world applications due to their sheer size. For…

Artificial Intelligence

Latest from Google AI – Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations

Posted by Juhyun Lee and Raman Sarokin, Software Engineers, Core Systems & Experiences The proliferation of large diffusion models for image generation has led to a significant increase in model size and inference workloads. On-device ML inference in mobile environments requires meticulous performance optimization and consideration of trade-offs due to resource constraints. Running inference of…

Artificial Intelligence

Latest from Google AI – Natural Language Assessment: A New Framework to Promote Education

Posted by Kedem Snir, Software Engineer, and Gal Elidan, Senior Staff Research Scientist, Google Research Whether it’s a professional honing their skills or a child learning to read, coaches and educators play a key role in assessing the learner’s answer to a question in a given context and guiding them towards a goal. These interactions…

A technology that sees the world from different angles

Similar Posts