MRRC: Multiple Role Representation Crossover Interpretation for Image
Captioning With R-CNN Feature Distribution Composition (FDC)
- URL: http://arxiv.org/abs/2002.06436v1
- Date: Sat, 15 Feb 2020 19:45:22 GMT
- Title: MRRC: Multiple Role Representation Crossover Interpretation for Image
Captioning With R-CNN Feature Distribution Composition (FDC)
- Authors: Chiranjib Sur
- Abstract summary: Research will provide a novel concept for context combination.
Will impact many applications to deal visual features as an equivalence of descriptions of objects, activities and events.
- Score: 9.89901717499058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While image captioning through machines requires structured learning and
basis for interpretation, improvement requires multiple context understanding
and processing in a meaningful way. This research will provide a novel concept
for context combination and will impact many applications to deal visual
features as an equivalence of descriptions of objects, activities and events.
There are three components of our architecture: Feature Distribution
Composition (FDC) Layer Attention, Multiple Role Representation Crossover
(MRRC) Attention Layer and the Language Decoder. FDC Layer Attention helps in
generating the weighted attention from RCNN features, MRRC Attention Layer acts
as intermediate representation processing and helps in generating the next word
attention, while Language Decoder helps in estimation of the likelihood for the
next probable word in the sentence. We demonstrated effectiveness of FDC, MRRC,
regional object feature attention and reinforcement learning for effective
learning to generate better captions from images. The performance of our model
enhanced previous performances by 35.3\% and created a new standard and theory
for representation generation based on logic, better interpretability and
contexts.
Related papers
- Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Hierarchical Aligned Multimodal Learning for NER on Tweet Posts [12.632808712127291]
multimodal named entity recognition (MNER) has attracted more attention.
We propose a novel approach, which can dynamically align the image and text sequence.
We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
arXiv Detail & Related papers (2023-05-15T06:14:36Z) - Stacked Cross-modal Feature Consolidation Attention Networks for Image
Captioning [1.4337588659482516]
This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information.
We propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features.
Our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
arXiv Detail & Related papers (2023-02-08T09:15:09Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.