Latest from Google AI – End-to-end Generative Pre-training for Multimodal Video Captioning
Posted by Paul Hongsuck Seo and Arsha Nagrani, Research Scientists, Google Research, Perception Team Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments…