a pretrained visible language mannequin for describing multi-event movies

Posted by Antoine Yang, Pupil Researcher, and Arsha Nagrani, Analysis Scientist, Google Analysis, Notion crew

Movies have develop into an more and more necessary a part of our day by day lives, spanning fields comparable to leisure, schooling, and communication. Understanding the content material of movies, nonetheless, is a difficult activity as movies typically comprise a number of occasions occurring at completely different time scales. For instance, a video of a musher hitching up canines to a canine sled earlier than all of them race away includes a protracted occasion (the canines pulling the sled) and a brief occasion (the canines being hitched to the sled). One strategy to spur analysis in video understanding is by way of the duty of dense video captioning, which consists of temporally localizing and describing all occasions in a minutes-long video. This differs from single picture captioning and commonplace video captioning, which consists of describing brief movies with a single sentence.

Dense video captioning techniques have broad functions, comparable to making movies accessible to individuals with visible or auditory impairments, routinely producing chapters for movies, or bettering the search of video moments in giant databases. Present dense video captioning approaches, nonetheless, have a number of limitations — for instance, they typically comprise extremely specialised task-specific parts, which make it difficult to combine them into highly effective basis fashions. Moreover, they’re typically educated solely on manually annotated datasets, that are very troublesome to acquire and therefore should not a scalable answer.

On this put up, we introduce “Vid2Seq: Massive-Scale Pretraining of a Visible Language Mannequin for Dense Video Captioning”, to look at CVPR 2023. The Vid2Seq structure augments a language mannequin with particular time tokens, permitting it to seamlessly predict occasion boundaries and textual descriptions in the identical output sequence. As a way to pre-train this unified mannequin, we leverage unlabeled narrated movies by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and utilizing the transcribed speech sentences as pseudo-event captions. The ensuing Vid2Seq mannequin pre-trained on hundreds of thousands of narrated movies improves the state-of-the-art on quite a lot of dense video captioning benchmarks together with YouCook2, ViTT and ActivityNet Captions. Vid2Seq additionally generalizes effectively to the few-shot dense video captioning setting, the video paragraph captioning activity, and the usual video captioning activity. Lastly, we’ve additionally launched the code for Vid2Seq right here.

Vid2Seq is a visible language mannequin that predicts dense occasion captions along with their temporal grounding in a video by producing a single sequence of tokens.

A visible language mannequin for dense video captioning

Multimodal transformer architectures have improved the state-of-the-art on a variety of video duties, comparable to motion recognition. Nevertheless it isn’t easy to adapt such an structure to the advanced activity of collectively localizing and captioning occasions in minutes-long movies.

For a normal overview of how we obtain this, we increase a visible language mannequin with particular time tokens (like textual content tokens) that characterize discretized timestamps within the video, just like Pix2Seq within the spatial area. Given visible inputs, the ensuing Vid2Seq mannequin can each take as enter and generate sequences of textual content and time tokens. First, this allows the Vid2Seq mannequin to grasp the temporal data of the transcribed speech enter, which is solid as a single sequence of tokens. Second, this permits Vid2Seq to collectively predict dense occasion captions and temporally floor them within the video whereas producing a single sequence of tokens.

The Vid2Seq structure features a visible encoder and a textual content encoder, which encode the video frames and the transcribed speech enter, respectively. The ensuing encodings are then forwarded to a textual content decoder, which autoregressively predicts the output sequence of dense occasion captions along with their temporal localization within the video. The structure is initialized with a highly effective visible spine and a robust language mannequin.

Vid2Seq mannequin overview: We formulate dense occasion captioning as a sequence-to-sequence drawback, utilizing particular time tokens to permit the mannequin to seamlessly perceive and generate sequences of tokens containing each textual semantic data and temporal localization data grounding every textual content sentence within the video.

Massive-scale pre-training on untrimmed narrated movies

Because of the dense nature of the duty, the guide assortment of annotations for dense video captioning is especially costly. Therefore we pre-train the Vid2Seq mannequin utilizing unlabeled narrated movies, that are simply obtainable at scale. Specifically, we use the YT-Temporal-1B dataset, which incorporates 18 million narrated movies masking a variety of domains.

We use transcribed speech sentences and their corresponding timestamps as supervision, that are solid as a single sequence of tokens. We pre-train Vid2Seq with a generative goal that teaches the decoder to foretell the transcribed speech sequence given visible inputs solely, and a denoising goal that encourages multimodal studying by requiring the mannequin to foretell masked tokens given a loud transcribed speech sequence and visible inputs. Specifically, noise is added to the speech sequence by randomly masking out spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated movies with a generative goal (prime) and a denoising goal (backside).

Outcomes on downstream dense video captioning benchmarks

The ensuing pre-trained Vid2Seq mannequin will be fine-tuned on downstream duties with a easy most probability goal utilizing instructor forcing (i.e., predicting the subsequent token given earlier ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state-of-the-art on three commonplace downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we offer further ablation research, qualitative outcomes, in addition to ends in the few-shot settings and within the video paragraph captioning activity.

Comparability to state-of-the-art strategies for dense video captioning (left) and for video clip captioning (proper), on the CIDEr metric (increased is healthier).

Conclusion

We introduce Vid2Seq, a novel visible language mannequin for dense video captioning that merely predicts all occasion boundaries and captions as a single sequence of tokens. Vid2Seq will be successfully pretrained on unlabeled narrated movies at scale, and achieves state-of-the-art outcomes on varied downstream dense video captioning benchmarks. Study extra from the paper and seize the code right here.

Acknowledgements

This analysis was carried out by Antoine Yang, Arsha Nagrani, Paul Hongsuck Search engine marketing, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.

11 Comments

amirdrassi_ofei says:

November 10, 2023 at 8:24 pm

Amirdrassil Boost – the Best in the Market
wow amirdrassil boost [url=https://amirdrassil-boost.com]https://amirdrassil-boost.com[/url].

smartfon_fsmr says:

November 11, 2023 at 7:09 pm

Лучший смартфон для игр: высокая производительность и графика
купить смартфон в донецке цена [url=https://www.kupit-smartfon-v-dnr.ru]https://www.kupit-smartfon-v-dnr.ru[/url].

prostitutk_qqet says:

November 12, 2023 at 9:30 am

Найди свою идеальную проститутку в Москве
проститутки на ночь [url=https://www.prostitutki-i-individualki-moskvy.top/]https://www.prostitutki-i-individualki-moskvy.top/[/url].

amirdrassi_rqei says:

November 12, 2023 at 9:39 pm

The Best Amirdrassil Boosting Service
amirdrassil raid boost [url=http://amirdrassil-boost.com/]http://amirdrassil-boost.com/[/url].

zaymy_rkSn says:

November 12, 2023 at 11:02 pm

Неожиданные траты? Сервис онлайн займов на карту спасут ваш бюджет
лучшие сервисы микрозаймов [url=http://www.servis-onlain-zaymov-na-bankovskuyu-kartu.ru]http://www.servis-onlain-zaymov-na-bankovskuyu-kartu.ru[/url].

usyplenie_ozEi says:

November 13, 2023 at 9:12 am

Избавьте животных от страданий: преимущества эвтаназии в Москве
эвтаназия кошек на дому [url=https://www.usyplenie-zhivotnyh-v-moskve.top/]https://www.usyplenie-zhivotnyh-v-moskve.top/[/url].

zaym_ecki says:

November 14, 2023 at 7:11 am

Получите займ на карту сегодня без процентов
круглосуточные мфо онлайн [url=zaym-bez-procentov-mgnovenno-kruglosutochno-bez-otkaza.ru]zaym-bez-procentov-mgnovenno-kruglosutochno-bez-otkaza.ru[/url].

zaym_mbOa says:

November 14, 2023 at 9:41 am

Без процентов займ: воспользуйтесь нашим предложением
займ на карту 1 без процентов онлайн [url=https://bez-procentow-zaim.ru/]https://bez-procentow-zaim.ru/[/url].

shtory_pbpl says:

November 15, 2023 at 10:01 am

Новое слово в мире штор: Римские шторы с электроприводом
шторы с электроприводом [url=https://www.prokarniz24.ru/]https://www.prokarniz24.ru/[/url].

santehnik_cgEt says:

November 16, 2023 at 1:59 pm

Услуги сантехника: Прочистка засоров и устранение неприятных запахов
сантехник услуги [url=http://www.vyzovsantekhnikaspb.ru/]http://www.vyzovsantekhnikaspb.ru/[/url].

evakuator_wuEi says:

November 16, 2023 at 10:45 pm

Эвакуатор после ДТП: Быстро приедем и перевезем ваш поврежденный автомобиль
эвакуатор новосибирск дешево [url=xn—–6kcagcd2cbog5agfcbgyiqedgw0w.xn--p1ai]xn—–6kcagcd2cbog5agfcbgyiqedgw0w.xn--p1ai[/url].