MRRC: Multiple Role Representation Crossover Interpretation for Image
Captioning With R-CNN Feature Distribution Composition (FDC)
- URL: http://arxiv.org/abs/2002.06436v1
- Date: Sat, 15 Feb 2020 19:45:22 GMT
- Title: MRRC: Multiple Role Representation Crossover Interpretation for Image
Captioning With R-CNN Feature Distribution Composition (FDC)
- Authors: Chiranjib Sur
- Abstract summary: Research will provide a novel concept for context combination.
Will impact many applications to deal visual features as an equivalence of descriptions of objects, activities and events.
- Score: 9.89901717499058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While image captioning through machines requires structured learning and
basis for interpretation, improvement requires multiple context understanding
and processing in a meaningful way. This research will provide a novel concept
for context combination and will impact many applications to deal visual
features as an equivalence of descriptions of objects, activities and events.
There are three components of our architecture: Feature Distribution
Composition (FDC) Layer Attention, Multiple Role Representation Crossover
(MRRC) Attention Layer and the Language Decoder. FDC Layer Attention helps in
generating the weighted attention from RCNN features, MRRC Attention Layer acts
as intermediate representation processing and helps in generating the next word
attention, while Language Decoder helps in estimation of the likelihood for the
next probable word in the sentence. We demonstrated effectiveness of FDC, MRRC,
regional object feature attention and reinforcement learning for effective
learning to generate better captions from images. The performance of our model
enhanced previous performances by 35.3\% and created a new standard and theory
for representation generation based on logic, better interpretability and
contexts.
Related papers
- Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning [86.99944014645322]
We introduce a novel framework, Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning.
We decompose each query image into its high-frequency and low-frequency components, and parallel incorporate them into the feature embedding network.
Our framework establishes new state-of-the-art results on multiple cross-domain few-shot learning benchmarks.
arXiv Detail & Related papers (2024-11-03T04:02:35Z) - HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction [24.46493675079128]
OCR-dependent methods rely on offline OCR engines, while OCR-free methods might produce outputs that lack interpretability or contain hallucinated content.
We propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task.
Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.
arXiv Detail & Related papers (2024-11-02T05:00:13Z) - Hierarchical Aligned Multimodal Learning for NER on Tweet Posts [12.632808712127291]
multimodal named entity recognition (MNER) has attracted more attention.
We propose a novel approach, which can dynamically align the image and text sequence.
We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
arXiv Detail & Related papers (2023-05-15T06:14:36Z) - Stacked Cross-modal Feature Consolidation Attention Networks for Image
Captioning [1.4337588659482516]
This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information.
We propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features.
Our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
arXiv Detail & Related papers (2023-02-08T09:15:09Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.