Video-aided Unsupervised Grammar Induction
- URL: http://arxiv.org/abs/2104.04369v1
- Date: Fri, 9 Apr 2021 14:01:36 GMT
- Title: Video-aided Unsupervised Grammar Induction
- Authors: Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu, Jiebo Luo
- Abstract summary: We investigate video-aided grammar induction, which learns a constituency from both unlabeled text and its corresponding video.
Video provides even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases.
We propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities.
- Score: 108.53765268059425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate video-aided grammar induction, which learns a constituency
parser from both unlabeled text and its corresponding video. Existing methods
of multi-modal grammar induction focus on learning syntactic grammars from
text-image pairs, with promising results showing that the information from
static images is useful in induction. However, videos provide even richer
information, including not only static objects but also actions and state
changes useful for inducing verb phrases. In this paper, we explore rich
features (e.g. action, object, scene, audio, face, OCR and speech) from videos,
taking the recent Compound PCFG model as the baseline. We further propose a
Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich
features from different modalities. Our proposed MMC-PCFG is trained end-to-end
and outperforms each individual modality and previous state-of-the-art systems
on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the
effectiveness of leveraging video information for unsupervised grammar
induction.
Related papers
- Grammar Induction from Visual, Speech and Text [91.98797120799227]
This work introduces a novel visual-audio-text grammar induction task (textbfVAT-GI)
Inspired by the fact that language grammar exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction.
We propose a visual-audio-text inside-outside autoencoder (textbfVaTiora) framework, which leverages rich modal-specific and complementary features for effective grammar parsing.
arXiv Detail & Related papers (2024-10-01T02:24:18Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.