Latest from MIT Tech Review – What’s next for generative video

When OpenAI revealed its new generative video model, Sora, last month, it invited a handful of filmmakers to try it out. This week the company published the results: seven surreal short films that leave no doubt that the future of generative video is coming fast.

The first batch of models that could turn text into video appeared in late 2022, from companies including Meta, Google, and video-tech startup Runway. It was a neat trick, but the results were grainy, glitchy, and just a few seconds long.

Fast-forward 18 months, and the best of Sora’s high-definition, photorealistic output is so stunning that some breathless observers are predicting the death of Hollywood. Runway’s latest models can produce short clips that rival those made by blockbuster animation studios. Midjourney and Stability AI, the firms behind two of the most popular text-to-image models, are now working on video as well.

A number of companies are racing to make a business on the back of these breakthroughs. Most are figuring out what that business is as they go. “I’ll routinely scream, ‘Holy cow, that is wicked cool,’ while playing with these tools,” says Gary Lipkowitz, CEO of Vyond, a firm that provides a point-and-click platform for putting together short animated videos. “But how can you use this at work?”

Whatever the answer to that question, it will probably upend a wide range of businesses and change the roles of many professionals, from animators to advertisers. Fears of misuse are also growing. The widespread ability to generate fake video will make it easier than ever to flood the internet with propaganda and nonconsensual porn. We can see it coming. The problem is, nobody has a good fix.

As we continue to get to grips what’s ahead—good and bad—here are four things to think about. We’ve also curated a selection of the best videos filmmakers have made using this technology, including an exclusive reveal of Somme Requiem, an experimental short film by Los Angeles-based production company Myles. Read on for a taste of where AI moviemaking is headed.

1. Sora is just the start

OpenAI’s Sora is currently head and shoulders above the competition in video generation. But other companies are working hard to catch up. The market is going to get extremely crowded over the next few months as more firms refine their technology and start rolling out Sora’s rivals.

The UK-based startup Haiper came out of stealth this month. It was founded in 2021 by former Google DeepMind and TikTok researchers who wanted to work on technology called neural radiance fields, or NeRF, which can transform 2D images into 3D virtual environments. They thought a tool that turned snapshots into scenes users could step into would be useful for making video games.

But six months ago, Haiper pivoted from virtual environments to video clips, adapting its technology to fit what CEO Yishu Miao believes will be an even bigger market than games. “We realized that video generation was the sweet spot,” says Miao. “There will be a super-high demand for it.”

Air Head is a short film made by Shy Kids, a pop band and filmaking collective based in Toronto, using Sora.

Like OpenAI’s Sora, Haiper’s generative video tech uses a diffusion model to manage the visuals and a transformer (the component in large language models like GPT-4 that makes them so good at predicting what comes next), to manage the consistency between frames. “Videos are sequences of data, and transformers are the best model to learn sequences,” says Miao.

Consistency is a big challenge for generative video and the main reason existing tools produce just a few seconds of video at a time. Transformers for video generation can boost the quality and length of the clips. The downside is that transformers make stuff up, or hallucinate. In text, this is not always obvious. In video, it can result in, say, a person with multiple heads. Keeping transformers on track requires vast silos of training data and warehouses full of computers.

That’s why Irreverent Labs, founded by former Microsoft researchers, is taking a different approach. Like Haiper, Irreverent Labs started out generating environments for games before switching to full video generation. But the company doesn’t want to follow the herd by copying what OpenAI and others are doing. “Because then it’s a battle of compute, a total GPU war,” says David Raskino, Irreverent’s co-founder and CTO. “And there’s only one winner in that scenario, and he wears a leather jacket.” (He’s talking about Jensen Huang, CEO of the trillion-dollar chip giant Nvidia.)

Instead of using a transformer, Irreverent’s tech combines a diffusion model with a model that predicts what’s in the next frame based on common-sense physics, such as how a ball bounces or how water splashes on the floor. Raskino says this approach reduces both training costs and the number of hallucinations. The model still produces glitches, but they are distortions of physics (like a bouncing ball not following a smooth curve, for example) with known mathematical fixes that can be applied to the video after it is generated, he says.

Which approach will last remains to be seen. Miao compares today’s technology to large language models circa GPT-2. Five years ago, OpenAI’s groundbreaking early model amazed people because it showed what was possible. But it took several more years for the technology to become a game-changer.

It’s the same with video, says Miao: “We’re all at the bottom of the mountain.”

2. What will people do with generative video?

Video is the medium of the internet. YouTube, TikTok, newsreels, ads: expect to see synthetic video popping up everywhere there’s video already.

The marketing industry is one of the most enthusiastic adopters of generative technology. Two thirds of marketing professionals have experimented with generative AI in their jobs, according to a recent survey Adobe carried out in the US, with more than half saying they have used the technology to produce images.

Generative video is next. A few marketing firms have already put out short films to demonstrate the technology’s potential. The latest example is the 2.5-minute-long Somme Requiem, made by Myles. You can watch the film below in an exclusive reveal from MIT Technology Review.

Somme Requiem is a short film made by Los Angeles production company Myles. Every shot was generated using Runway’s Gen 2 model. The clips were then edited together by a team of video editors at Myles.

Somme Requiem depicts snowbound soldiers during the World War I Christmas ceasefire in 1914. The film is made up of dozens of different shots that were produced using a generative video model from Runway, then stitched together, color-corrected, and set to music by human video editors at Myles. “The future of storytelling will be a hybrid workflow,” says founder and CEO Josh Kahn.

Kahn picked the period wartime setting to make a point. He notes that the Apple TV+ series Masters of the Air, which follows a group of World War II airmen, cost $250 million. The team behind Peter Jackson’s World War I documentary They Shall Not Grow Old spent four years curating and restoring more than 100 hours of archival film. “Most filmmakers can only dream of ever having an opportunity to tell a story in this genre,” says Kahn.

“Independent filmmaking has been kind of dying,” he adds. “I think this will create an incredible resurgence.”

Raskino hopes so. “The horror movie genre is where people test new things, to try new things until they break,” he says. “I think we’re going to see a blockbuster horror movie created by, like, four people in a basement somewhere using AI.”

So is generative video a Hollywood-killer? Not yet. Somme Requiem’s scene-setting shots—empty woods, a desolate military camp—look great. But the people in it are still afflicted with mangled fingers and distorted faces, hallmarks of the technology. Generative video is best at wide-angle pans or lingering close-ups, which creates an eerie atmosphere but little action. If Somme Requiem were any longer it would get dull.

But scene-setting shots pop up all the time in feature-length movies. Most are just a few seconds long, but they can take hours to film. Raskino suggests that generative video models could soon be used to produce those in-between shots for a fraction of the cost. This could also be done on the fly in later stages of production, without requiring a reshoot.

Michal Pechoucek, CTO at Gen Digital, the cybersecurity giant behind a range of antivirus brands including Norton and Avast, agrees. “I think this is where the technology is headed,” he says. “We’ll see many different models, each specifically trained in a certain domain of movie production. These will just be tools used by talented video production teams.”

We’re not there quite yet. A big problem with generative video is the lack of control users have over the output. Producing still images can be hit and miss; producing a few seconds of video is even more risky.

“Right now it’s still fun, you get a-ha moments,” says Miao. “But generating video that is exactly what you want is a very hard technical problem. We are some way off generating long, consistent videos from a single prompt.”

That’s why Vyond’s Lipkowitz thinks the technology isn’t yet ready for most corporate clients. These users want a lot more control over the look of a video than current tools give them, he says.

Thousands of companies around the world, including around 65% of the Fortune 500 firms, use Vyond’s platform to create animated videos for in-house communications, training, marketing, and more. Vyond draws on a range of generative models, including text-to-image and text-to-voice, but provides a simple drag-and-drop interface that lets users put together a video by hand, piece by piece, rather than generate a full clip with a click.

Running a generative model is like rolling dice, says Lipkowitz. “This is a hard no for most video production teams, particularly in the enterprise sector where everything must be pixel-perfect and on brand,” he says. “If the video turns out bad—maybe the characters have too many fingers, or maybe there is a company logo that is the wrong color—well, unlucky, that’s just how gen AI works.”

The solution? More data, more training, repeat. “I wish I could point to some sophisticated algorithms,” says Miao. “But no, it’s just a lot more learning.”

3. Misinformation isn’t new, but deepfakes will make it worse.

Online misinformation has been undermining our faith in the media, in institutions, and in each other for years. Some fear that adding fake video to the mix will destroy whatever pillars of shared reality we have left.

“We are replacing trust with mistrust, confusion, fear, and hate,” says Pechoucek. “Society without ground truth will degenerate.”

Pechoucek is especially worried about the malicious use of deepfakes in elections. During last year’s elections in Slovakia, for example, attackers shared a fake video that showed the leading candidate discussing plans to manipulate voters. The video was low quality and easy to spot as a deepfake. But Pechoucek believes it was enough to turn the result in favor of the other candidate.

Adventurous Puppies is a short clip made by OpenAI using with Sora.

John Wissinger, who leads the strategy and innovation teams at Blackbird AI, a firm that tracks and manages the spread of misinformation online, believes fake video will be most persuasive when it blends real and fake footage. Take two videos showing President Joe Biden walking across a stage. In one he stumbles, in the other he doesn’t. Who is to say which is real?

“Let’s say an event actually occurred, but the way it’s presented to me is subtly different,” says Wissinger. “That can affect my emotional response to it.” As Pechoucek noted, a fake video doesn’t even need to be that good to make an impact. A bad fake that fits existing biases will do more damage than a slick fake that doesn’t, says Wissinger.

That’s why Blackbird focuses on who is sharing what with whom. In some sense, whether something is true or false is less important than where it came from and how it is being spread, says Wissinger. His company already tracks low-tech misinformation, such as social media posts showing real images out of context. Generative technologies make things worse, but the problem of people presenting media in misleading ways, deliberately or otherwise, is not new, he says.

Throw bots into the mix, sharing and promoting misinformation on social networks, and things get messy. Just knowing that fake media is out there will sow seeds of doubt into bad-faith discourse. “You can see how pretty soon it could become impossible to discern between what’s synthesized and what’s real anymore,” says Wissinger.

4. We are facing a new online reality.

Fakes will soon be everywhere, from disinformation campaigns, to ad spots, to Hollywood blockbusters. So what can we do to figure out what’s real and what’s just fantasy? There are a range of solutions, but none will work by themselves.

The tech industry is working on the problem. Most generative tools try to enforce certain terms of use, such as preventing people from creating videos of public figures. But there are ways to bypass these filters, and open-source versions of the tools may come with more permissive policies.

Companies are also developing standards for watermarking AI-generated media and tools for detecting it. But not all tools will add watermarks, and watermarks can be stripped from a video’s metadata. No reliable detection tool exists either. Even if such tools worked, they would become part of a cat-and-mouse game of trying to keep up with advances in the models they are designed to police.

Spaghetti Eating Will Smith is a short film made by OpenAI using Sora.

Online platforms like X and Facebook have poor track records when it comes to moderation. We should not expect them to do better once the problem gets harder. Miao used to work at TikTok, where he helped build a moderation tool that detects video uploads that violate TikTok’s terms of use. Even he is wary of what’s coming: “There’s real danger out there,” he says. “Don’t trust things that you see on your laptop.”

Blackbird has developed a tool called Compass, which lets you fact check articles and social media posts. Paste a link into the tool and a large language model generates a blurb drawn from trusted online sources (these are always open to review, says Wissinger) that gives some context for the linked material. The result is very similar to the community notes that sometimes get attached to controversial posts on sites like X, Facebook, and Instagram. The company envisions having Compass generate community notes for anything. “We’re working on it,” says Wissinger.

But people who put links into a fact-checking website are already pretty savvy—and many others may not know such tools exist, or may not be inclined to trust them. Misinformation also tends to travel far wider than any subsequent correction.

In the meantime, people disagree on whose problem this is in the first place. Pechoucek says tech companies need to open up their software to allow for more competition around safety and trust. That would also let cybersecurity firms like his develop third-party software to police this tech. It’s what happened 30 years ago when Windows had a malware problem, he says: “Microsoft let antivirus firms in to help protect Windows. As a result, the online world became a safer place.”

But Pechoucek isn’t too optimistic. “Technology developers need to build their tools with safety as the top objective,” he says. “But more people think about how to make the technology more powerful than worry about how to make it more safe.”

Made by OpenAI using Sora.

There’s a common fatalistic refrain in the tech industry: change is coming, deal with it. “Generative AI is not going to get uninvented,” says Raskino. “This may not be very popular, but I think it’s true: I don’t think tech companies can bear the full burden. At the end of the day, the best defense against any technology is a very well-educated public. There’s no shortcut.”

Miao agrees. “It’s inevitable that we will massively adopt generative technology,” he says. “But it’s also the responsibility of the whole of society. We need to educate people.”

“Technology will move forward, and we need to be prepared for this change,” he adds. “We need to remind our parents, our friends, that the things they see on their screen might not be authentic.” This is especially true for older generations, he says: “Our parents need to be aware of this kind of danger. I think everyone should work together.”

We’ll need to work together quickly. When Sora came out a month ago, the tech world was stunned by how quickly generative video had progressed. But the vast majority of people have no idea this kind of technology even exists, says Wissinger: “They certainly don’t understand the trend lines that we’re on. I think it’s going to catch the world by storm.”

1. Sora is just the start

2. What will people do with generative video?

3. Misinformation isn’t new, but deepfakes will make it worse.

4. We are facing a new online reality.

Similar Posts