Related papers: Discriminative Latent Semantic Graph for Video Captioning

Discriminative Latent Semantic Graph for Video Captioning

URL: http://arxiv.org/abs/2108.03662v2
Date: Tue, 10 Aug 2021 13:55:55 GMT
Title: Discriminative Latent Semantic Graph for Video Captioning
Authors: Yang Bai, Junyan Wang, Yang Long, Bingzhang Hu, Yang Song, Maurice Pagnucco, Yu Guan
Abstract summary: Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
Score: 24.15455227330031
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.

Related papers

Dense Video Captioning using Graph-based Sentence Summarization [80.52481563888459]
We propose a graph-based partition-and-summarization framework for dense video captioning.<n>We focus on the summarization" stage, and propose a framework that effectively exploits the relationship between semantic words for summarization.
arXiv Detail & Related papers (2025-06-25T16:23:43Z)
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z)
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval. We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN) HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z)
GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z)
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning [41.14313691818424]
We propose an Object-Oriented Non-Autoregressive approach (O2NA) for video captioning. O2NA performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Experiments on two benchmark datasets, MSR-VTT and MSVD, demonstrate the effectiveness of O2NA.
arXiv Detail & Related papers (2021-08-05T04:17:20Z)
MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions. Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections. A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z)
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.