An astounding variety of movies can be found on the Net, masking quite a lot of content material from on a regular basis moments individuals share to historic moments to scientific observations, every of which comprises a singular document of the world. The suitable instruments might assist researchers analyze these movies, reworking how we perceive the world round us.
Movies supply dynamic visible content material way more wealthy than static photos, capturing motion, adjustments, and dynamic relationships between entities. Analyzing this complexity, together with the immense range of publicly obtainable video information, calls for fashions that transcend conventional picture understanding. Consequently, most of the approaches that finest carry out on video understanding nonetheless depend on specialised fashions tailored for explicit duties. Not too long ago, there was thrilling progress on this space utilizing video basis fashions (ViFMs), akin to VideoCLIP, InternVideo, VideoCoCa, and UMT. Nonetheless, constructing a ViFM that handles the sheer range of video information stays a problem.
With the aim of constructing a single mannequin for general-purpose video understanding, we introduce “VideoPrism: A Foundational Visual Encoder for Video Understanding”. VideoPrism is a ViFM designed to deal with a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering (QA). We suggest improvements in each the pre-training information in addition to the modeling technique. We pre-train VideoPrism on a large and numerous dataset: 36 million high-quality video-text pairs and 582 million video clips with noisy or machine-generated parallel textual content. Our pre-training strategy is designed for this hybrid information, to study each from video-text pairs and the movies themselves. VideoPrism is extremely straightforward to adapt to new video understanding challenges, and achieves state-of-the-art efficiency utilizing a single frozen mannequin.
Pre-training information
A robust ViFM wants a really massive assortment of movies on which to coach — just like different basis fashions (FMs), akin to these for big language fashions (LLMs). Ideally, we’d need the pre-training information to be a consultant pattern of all of the movies on this planet. Whereas naturally most of those movies should not have good captions or descriptions, even imperfect textual content can present helpful details about the semantic content material of the video.
To offer our mannequin the absolute best place to begin, we put collectively a large pre-training corpus consisting of a number of private and non-private datasets, together with YT-Temporal-180M, InternVid, VideoCC, WTS-70M, and so forth. This consists of 36 million rigorously chosen movies with high-quality captions, together with a further 582 million clips with various ranges of noisy textual content (like auto-generated transcripts). To our information, that is the biggest and most numerous video coaching corpus of its type.
Statistics on the video-text pre-training information. The big variations of the CLIP similarity scores (the upper, the higher) show the varied caption high quality of our pre-training information, which is a byproduct of the assorted methods used to reap the textual content. |
Two-stage coaching
The VideoPrism mannequin structure stems from the usual vision transformer (ViT) with a factorized design that sequentially encodes spatial and temporal info following ViViT. Our coaching strategy leverages each the high-quality video-text information and the video information with noisy textual content talked about above. To begin, we use contrastive learning (an strategy that minimizes the space between optimistic video-text pairs whereas maximizing the space between damaging video-text pairs) to show our mannequin to match movies with their very own textual content descriptions, together with imperfect ones. This builds a basis for matching semantic language content material to visible content material.
After video-text contrastive coaching, we leverage the gathering of movies with out textual content descriptions. Right here, we construct on the masked video modeling framework to foretell masked patches in a video, with a couple of enhancements. We practice the mannequin to foretell each the video-level international embedding and token-wise embeddings from the first-stage mannequin to successfully leverage the information acquired in that stage. We then randomly shuffle the expected tokens to stop the mannequin from studying shortcuts.
What is exclusive about VideoPrism’s setup is that we use two complementary pre-training alerts: textual content descriptions and the visible content material inside a video. Textual content descriptions typically deal with what issues appear like, whereas the video content material supplies details about motion and visible dynamics. This permits VideoPrism to excel in duties that demand an understanding of each look and movement.
Outcomes
We conduct intensive analysis on VideoPrism throughout 4 broad classes of video understanding duties, together with video classification and localization, video-text retrieval, video captioning, query answering, and scientific video understanding. VideoPrism achieves state-of-the-art efficiency on 30 out of 33 video understanding benchmarks — all with minimal adaptation of a single, frozen mannequin.
VideoPrism in comparison with the earlier best-performing FMs. |
Classification and localization
We consider VideoPrism on an present large-scale video understanding benchmark (VideoGLUE) masking classification and localization duties. We discover that (1) VideoPrism outperforms all the different state-of-the-art FMs, and (2) no different single mannequin constantly got here in second place. This tells us that VideoPrism has discovered to successfully pack quite a lot of video alerts into one encoder — from semantics at completely different granularities to look and movement cues — and it really works nicely throughout quite a lot of video sources.
Combining with LLMs
We additional discover combining VideoPrism with LLMs to unlock its potential to deal with numerous video-language duties. Specifically, when paired with a textual content encoder (following LiT) or a language decoder (akin to PaLM-2), VideoPrism could be utilized for video-text retrieval, video captioning, and video QA duties. We examine the mixed fashions on a broad and difficult set of vision-language benchmarks. VideoPrism units the brand new state-of-the-art on most benchmarks. From the visible outcomes, we discover that VideoPrism is able to understanding advanced motions and appearances in movies (e.g., the mannequin can acknowledge the completely different colours of spinning objects on the window within the visible examples under). These outcomes show that VideoPrism is strongly appropriate with language fashions.
Scientific purposes
Lastly, we check VideoPrism on datasets utilized by scientists throughout domains, together with fields akin to ethology, behavioral neuroscience, and ecology. These datasets sometimes require area experience to annotate, for which we leverage present scientific datasets open-sourced by the neighborhood together with Fly vs. Fly, CalMS21, ChimpACT, and KABR. VideoPrism not solely performs exceptionally nicely, however really surpasses fashions designed particularly for these duties. This means instruments like VideoPrism have the potential to remodel how scientists analyze video information throughout completely different fields.
Conclusion
With VideoPrism, we introduce a strong and versatile video encoder that units a brand new commonplace for general-purpose video understanding. Our emphasis on each constructing a large and various pre-training dataset and modern modeling methods has been validated by our intensive evaluations. Not solely does VideoPrism constantly outperform sturdy baselines, however its distinctive potential to generalize positions it nicely for tackling an array of real-world purposes. Due to its potential broad use, we’re dedicated to persevering with additional accountable analysis on this area, guided by our AI Principles. We hope VideoPrism paves the best way for future breakthroughs on the intersection of AI and video evaluation, serving to to comprehend the potential of ViFMs throughout domains akin to scientific discovery, training, and healthcare.
Acknowledgements
This weblog publish is made on behalf of all of the VideoPrism authors: Lengthy Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Solar, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. We sincerely thank David Hendon for his or her product administration efforts, and Alex Siegman, Ramya Ganeshan, and Victor Gomes for his or her program and useful resource administration efforts. We additionally thank Hassan Akbari, Sherry Ben, Yoni Ben-Meshulam, Chun-Te Chu, Sam Clearwater, Yin Cui, Ilya Figotin, Anja Hauth, Sergey Ioffe, Xuhui Jia, Yeqing Li, Lu Jiang, Zu Kim, Dan Kondratyuk, Invoice Mark, Arsha Nagrani, Caroline Pantofaru, Sushant Prakash, Cordelia Schmid, Bryan Seybold, Mojtaba Seyedhosseini, Amanda Sadler, Rif A. Saurous, Rachel Stigler, Paul Voigtlaender, Pingmei Xu, Chaochao Yan, Xuan Yang, and Yukun Zhu for the discussions, assist, and suggestions that tremendously contributed to this work. We’re grateful to Jay Yagnik, Rahul Sukthankar, and Tomas Izo for his or her enthusiastic assist for this mission. Lastly, we thank Tom Small, Jennifer J. Solar, Hao Zhou, Nitesh B. Gundavarapu, Luke Friedman, and Mikhail Sirotenko for the large assist with making this weblog publish.