Related papers: Bypass Network for Semantics Driven Image Paragraph Captioning

Bypass Network for Semantics Driven Image Paragraph Captioning

URL: http://arxiv.org/abs/2206.10059v1
Date: Tue, 21 Jun 2022 00:48:22 GMT
Title: Bypass Network for Semantics Driven Image Paragraph Captioning
Authors: Qi Zheng, Chaoyue Wang, Dadong Wang
Abstract summary: Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. We propose a bypass network that separately models semantics and linguistic syntax of preceding sentences.
Score: 12.743882133781602
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.

Related papers

ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z)
Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings [24.255946996327104]
Unsupervised sentence embeddings task aims to convert sentences to semantic vector representations. Due to the token bias in pretrained language models, the models can not capture the fine-grained semantics in sentences. We propose a novel Self-Adaptive Reconstruction Contrastive Sentence Embeddings framework.
arXiv Detail & Related papers (2024-02-23T07:28:31Z)
On the Robustness of Text Vectorizers [9.904746542801838]
In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H"older or Lipschitz sense with respect to the Hamming distance.
arXiv Detail & Related papers (2023-03-09T16:37:37Z)
Semantic Operator Prediction and Applications [0.0]
QDMR formalism in semantic parsing is implemented using sequence to sequence model with attention but uses only part of speech(POS) as a representation of words of a sentence to make the training as simple and as fast as possible.
arXiv Detail & Related papers (2023-01-01T13:20:57Z)
Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning. We propose a new approach called context-aware fine-tuning. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z)
TopicNet: Semantic Graph-Guided Topic Discovery [51.71374479354178]
Existing deep hierarchical topic models are able to extract semantically meaningful topics from a text corpus in an unsupervised manner. We introduce TopicNet as a deep hierarchical topic model that can inject prior structural knowledge as an inductive bias to influence learning.
arXiv Detail & Related papers (2021-10-27T09:07:14Z)
Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
Neural Syntactic Preordering for Controlled Paraphrase Generation [57.5316011554622]
Our work uses syntactic transformations to softly "reorder'' the source sentence and guide our neural paraphrasing model. First, given an input sentence, we derive a set of feasible syntactic rearrangements using an encoder-decoder model. Next, we use each proposed rearrangement to produce a sequence of position embeddings, which encourages our final encoder-decoder paraphrase model to attend to the source words in a particular order.
arXiv Detail & Related papers (2020-05-05T09:02:25Z)
Multi-Step Inference for Reasoning Over Paragraphs [95.91527524872832]
Complex reasoning over text requires understanding and chaining together free-form predicates and logical connectives. We present a compositional model reminiscent of neural module networks that can perform chained logical reasoning.
arXiv Detail & Related papers (2020-04-06T21:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.