Discourse Analysis for Evaluating Coherence in Video Paragraph Captions
- URL: http://arxiv.org/abs/2201.06207v1
- Date: Mon, 17 Jan 2022 04:23:08 GMT
- Title: Discourse Analysis for Evaluating Coherence in Video Paragraph Captions
- Authors: Arjun R Akula, Song-Chun Zhu
- Abstract summary: We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs.
Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos.
Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
- Score: 99.37090317971312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video paragraph captioning is the task of automatically generating a coherent
paragraph description of the actions in a video. Previous linguistic studies
have demonstrated that coherence of a natural language text is reflected by its
discourse structure and relations. However, existing video captioning methods
evaluate the coherence of generated paragraphs by comparing them merely against
human paragraph annotations and fail to reason about the underlying discourse
structure. At UCLA, we are currently exploring a novel discourse based
framework to evaluate the coherence of video paragraphs. Central to our
approach is the discourse representation of videos, which helps in modeling
coherence of paragraphs conditioned on coherence of videos. We also introduce
DisNet, a novel dataset containing the proposed visual discourse annotations of
3000 videos and their paragraphs. Our experiment results have shown that the
proposed framework evaluates coherence of video paragraphs significantly better
than all the baseline methods. We believe that many other multi-discipline
Artificial Intelligence problems such as Visual Dialog and Visual Storytelling
would also greatly benefit from the proposed visual discourse framework and the
DisNet dataset.
Related papers
- NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.