Discourse Analysis for Evaluating Coherence in Video Paragraph Captions
- URL: http://arxiv.org/abs/2201.06207v1
- Date: Mon, 17 Jan 2022 04:23:08 GMT
- Title: Discourse Analysis for Evaluating Coherence in Video Paragraph Captions
- Authors: Arjun R Akula, Song-Chun Zhu
- Abstract summary: We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs.
Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos.
Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
- Score: 99.37090317971312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video paragraph captioning is the task of automatically generating a coherent
paragraph description of the actions in a video. Previous linguistic studies
have demonstrated that coherence of a natural language text is reflected by its
discourse structure and relations. However, existing video captioning methods
evaluate the coherence of generated paragraphs by comparing them merely against
human paragraph annotations and fail to reason about the underlying discourse
structure. At UCLA, we are currently exploring a novel discourse based
framework to evaluate the coherence of video paragraphs. Central to our
approach is the discourse representation of videos, which helps in modeling
coherence of paragraphs conditioned on coherence of videos. We also introduce
DisNet, a novel dataset containing the proposed visual discourse annotations of
3000 videos and their paragraphs. Our experiment results have shown that the
proposed framework evaluates coherence of video paragraphs significantly better
than all the baseline methods. We believe that many other multi-discipline
Artificial Intelligence problems such as Visual Dialog and Visual Storytelling
would also greatly benefit from the proposed visual discourse framework and the
DisNet dataset.
Related papers
- Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video [22.60291297308379]
This paper proposes a novel self-supervised framework for video summarization guided by Large Language Models (LLMs)
Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
arXiv Detail & Related papers (2024-05-14T18:07:04Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.