ViTOC: Vision Transformer and Object-aware Captioner
- URL: http://arxiv.org/abs/2411.07265v2
- Date: Wed, 13 Nov 2024 07:26:33 GMT
- Title: ViTOC: Vision Transformer and Object-aware Captioner
- Authors: Feiyang Huang,
- Abstract summary: ViTOC is a vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions.
By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.
- Score: 0.0
- License:
- Abstract: This paper presents ViTOC (Vision Transformer and Object-aware Captioner), a novel vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions. Unlike conventional approaches, ViTOC employs a dual-path architecture based on Vision Transformer and object detector, effectively fusing global visual features and local object information through learnable vectors. The model introduces an innovative object-aware prompting strategy that significantly enhances its capability in handling long-tail data. Experiments on the standard COCO dataset demonstrate that ViTOC outperforms baseline models across all evaluation metrics. Additionally, we propose a reference-free evaluation method based on CLIP to further validate the model's effectiveness. By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Attention Guided CAM: Visual Explanations of Vision Transformer Guided
by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision.
Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z) - ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for
Vision Transformer [0.0]
Vision Transformers (ViT) have demonstrated superior performance in various computer vision tasks such as image classification and object detection.
Current state-of-the-art solutions for ViT rely on class Attention-Rollout and Relevance techniques.
We propose a new gradient-free visual explanation method for ViT, called ViT-ReciproCAM, which does not require attention matrix and gradient information.
arXiv Detail & Related papers (2023-10-04T05:09:50Z) - FedPerfix: Towards Partial Model Personalization of Vision Transformers
in Federated Learning [9.950367271170592]
We investigate where and how to partially personalize a Vision Transformers (ViT) model.
Based on the insights that the self-attention layer and the classification head are the most sensitive parts of a ViT, we propose a novel approach called FedPerfix.
We evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrate its effectiveness compared to several advanced PFL methods.
arXiv Detail & Related papers (2023-08-17T19:22:30Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z) - Learning View and Target Invariant Visual Servoing for Navigation [9.873635079670093]
We learn viewpoint invariant and target invariant visual servoing for local mobile robot navigation.
We train deep convolutional network controller to reach the desired goal.
arXiv Detail & Related papers (2020-03-04T20:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.