Google DeepMind today launched the next generation of its powerful artificial intelligence model Gemini, which has an enhanced ability to work with large amounts of video, text, and images.

It’s an advancement from the three versions of Gemini 1.0 that Google announced back in December, ranging in size and complexity from Nano to Pro to Ultra. (It rolled out Gemini 1.0 Pro and 1.0 Ultra across many of its products last week.) Google is now releasing a preview of Gemini 1.5 Pro to select developers and business customers. The company says that the mid-tier Gemini 1.5 Pro matches its previous top-tier model, Gemini 1.0 Ultra, in performance, but uses less computing power (yes, the names are confusing!). 

Crucially, the 1.5 Pro model can handle much larger amounts of data from users, including the size of prompts. While every AI model has a ceiling of how much data it can digest, the standard version of the new Gemini 1.5 Pro can handle inputs as large as 128,000 tokens, which are words or parts of words that an AI model breaks inputs into. That’s on a par with the best version of GPT-4 (GPT-4 Turbo). 

However, a limited group of developers will be able to submit up to 1 million tokens to Gemini 1.5 Pro, which equates to roughly 1 hour of video, 11 hours of audio, or 700,000 words of text. That’s a significant jump that makes it possible to do things that no other models are currently capable of.

In one demonstration video shown by Google, using the million-token version, researchers fed the model a 402-page transcript of the Apollo moon landing mission. Then they showed Gemini a hand-drawn sketch of a boot, and asked it to identify the moment in the transcript that the drawing represents.

Related work from others:  Latest from Google AI - DynIBaR: Space-time view synthesis from videos of dynamic scenes

“This is the moment Neil Armstrong landed on the moon,” the chatbot responded correctly. “He said, ‘One small step for man, one giant leap for mankind.’”

The model was also able to identify moments of humor. When asked by the researchers to find a funny moment in the Apollo transcript, it picked out when astronaut Mike Collins referred to  Armstrong as “the Czar.” (Probably not the best line, but you get the point).  

In another demonstration, the team uploaded a 44-minute silent film featuring Buster Keaton and asked the AI to identify what information was on a piece of paper that, at some point in the movie, is removed from a character’s pocket. In less than a minute, the model found the scene and correctly recalled the text written on the paper. Researchers also repeated a similar task from the Apollo experiment, asking the model to find a scene in the film based on a drawing, which it completed. 

Google says it put Gemini 1.5 Pro through the usual battery of tests it uses when developing large language models, including evaluations that combine text, code, images, audio and video. It found that 1.5 Pro outperformed 1.0 Pro on 87% of the benchmarks and more or less matched 1.0 Ultra across all of them while using less computing power. 

The ability to handle larger inputs, Google says, is a result of progress in what’s called mixture-of-experts architecture. An AI using this design divides its neural network into chunks, only activating the parts that are relevant to the task at hand, rather than firing up the whole network at once. (Google is not alone in using this architecture; French AI firm Mistral released a model using it, and GPT-4 is rumored to employ the tech as well.)

Related work from others:  Latest from MIT : Researchers enhance peripheral vision in AI models

“In one way it operates much like our brain does, where not the whole brain activates all the time,” says Oriol Vinyals, a deep learning team lead at DeepMind. This compartmentalizing saves the AI computing power and can generate responses faster.

“That kind of fluidity going back and forth across different modalities, and using that to search and understand, is very impressive,” says Oren Etzioni, former technical director of the Allen Institute for Artificial Intelligence, who was not involved in the work. “This is stuff I have not seen before.”

An AI that can operate across modalities would more closely resemble the way that human beings behave. “People are naturally multimodal,” Etzioni says, because we can effortlessly switch between speaking, writing, and drawing images or charts to convey ideas. 

Etzioni cautioned against taking too much meaning from the developments, however. “There’s a famous line,” he says. “Never trust an AI demo.” 

For one, it’s not clear how much the demonstration videos left out or cherry-picked from various tasks (Google indeed received criticism for its early Gemini launch for not disclosing that the video was sped up). It’s also possible the model would not be able to replicate some of the demonstrations if the input wording were slightly tweaked. AI models in general, says Etzioni, are brittle. 

Today’s release of Gemini 1.5 Pro is limited to developers and enterprise customers. Google did not specify when it will be available for wider release. 

Share via
Copy link
Powered by Social Snap