Robust Image Captioning
- URL: http://arxiv.org/abs/2012.09732v1
- Date: Sun, 6 Dec 2020 00:33:17 GMT
- Title: Robust Image Captioning
- Authors: Daniel Yarnell, Xian Wang
- Abstract summary: In this study, we leverage the Object Relation using adversarial robust cut algorithm.
Our experimental study represent the promising performance of our proposed method for image captioning.
- Score: 3.20603058999901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated captioning of photos is a mission that incorporates the
difficulties of photo analysis and text generation. One essential feature of
captioning is the concept of attention: how to determine what to specify and in
which sequence. In this study, we leverage the Object Relation using
adversarial robust cut algorithm, that grows upon this method by specifically
embedding knowledge about the spatial association between input data through
graph representation. Our experimental study represent the promising
performance of our proposed method for image captioning.
Related papers
- Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Deep Learning Approaches on Image Captioning: A Review [0.5852077003870417]
Image captioning aims to generate natural language descriptions for visual content in the form of still images.
Deep learning and vision-language pre-training techniques have revolutionized the field, leading to more sophisticated methods and improved performance.
We address the challenges faced in this field by emphasizing issues such as object hallucination, missing context, illumination conditions, contextual understanding, and referring expressions.
We identify several potential future directions for research in this area, which include tackling the information misalignment problem between image and text modalities, mitigating dataset bias, incorporating vision-language pre-training methods to enhance caption generation, and developing improved evaluation tools to accurately
arXiv Detail & Related papers (2022-01-31T00:39:37Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Boost Image Captioning with Knowledge Reasoning [10.733743535624509]
We propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word.
We introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.
arXiv Detail & Related papers (2020-11-02T12:19:46Z) - UNISON: Unpaired Cross-lingual Image Captioning [17.60054750276632]
We present a novel unpaired cross-lingual method to generate image captions without relying on any caption corpus in the source or the target language.
Specifically, our method consists of two phases: (i) a cross-lingual auto-encoding process, which utilizing a sentence parallel (bitext) corpus to learn the mapping from the source to the target language in the scene graph encoding space and decode sentences in the target language, and (ii) a cross-modal unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to language modality.
arXiv Detail & Related papers (2020-10-03T06:14:06Z) - Comprehensive Image Captioning via Scene Graph Decomposition [51.660090468384375]
We address the challenging problem of image captioning by revisiting the representation of image scene graph.
At the core of our method lies the decomposition of a scene graph into a set of sub-graphs.
We design a deep model to select important sub-graphs, and to decode each selected sub-graph into a single target sentence.
arXiv Detail & Related papers (2020-07-23T00:59:21Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.