Cosine meets Softmax: A tough-to-beat baseline for visual grounding
- URL: http://arxiv.org/abs/2009.06066v1
- Date: Sun, 13 Sep 2020 19:35:43 GMT
- Title: Cosine meets Softmax: A tough-to-beat baseline for visual grounding
- Authors: Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna and Vineet
Gandhi
- Abstract summary: Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding.
We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%.
- Score: 17.316608734530124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a simple baseline for visual grounding for
autonomous driving which outperforms the state of the art methods, while
retaining minimal design choices. Our framework minimizes the cross-entropy
loss over the cosine distance between multiple image ROI features with a text
embedding (representing the give sentence/phrase). We use pre-trained networks
for obtaining the initial embeddings and learn a transformation layer on top of
the text embedding. We perform experiments on the Talk2Car dataset and achieve
68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our
investigation suggests reconsideration towards more approaches employing
sophisticated attention mechanisms or multi-stage reasoning or complex metric
learning loss functions by showing promise in simpler alternatives.
Related papers
- Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.81815833343026]
We introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding.<n>Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements.<n>On ChartQA, our approach improves accuracy from 70.88% (language-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT.
arXiv Detail & Related papers (2025-05-26T08:54:14Z) - Towards Cross-View-Consistent Self-Supervised Surround Depth Estimation [9.569646683579899]
Self-Supervised Surround Depth Estimation from consecutive images offers an economical alternative.
Previous SSSDE methods have proposed different mechanisms to fuse information across images, but few of them explicitly consider the cross-view constraints.
This paper proposes an efficient and consistent pose estimation design and two loss functions to enhance cross-view consistency for SSSDE.
arXiv Detail & Related papers (2024-07-04T16:29:05Z) - Is Cross-modal Information Retrieval Possible without Training? [4.616703548353372]
We take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem.
That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image.
Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval.
arXiv Detail & Related papers (2023-04-20T02:36:18Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - Contrastive Learning of Visual-Semantic Embeddings [4.7464518249313805]
We propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding.
We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks.
arXiv Detail & Related papers (2021-10-17T17:28:04Z) - Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z) - Recurrent Multi-view Alignment Network for Unsupervised Surface
Registration [79.72086524370819]
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data.
We propose to represent the non-rigid transformation with a point-wise combination of several rigid transformations.
We also introduce a differentiable loss function that measures the 3D shape similarity on the projected multi-view 2D depth images.
arXiv Detail & Related papers (2020-11-24T14:22:42Z) - Progressively Guided Alternate Refinement Network for RGB-D Salient
Object Detection [63.18846475183332]
We aim to develop an efficient and compact deep network for RGB-D salient object detection.
We propose a progressively guided alternate refinement network to refine it.
Our model outperforms existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2020-08-17T02:55:06Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z) - Scan-based Semantic Segmentation of LiDAR Point Clouds: An Experimental
Study [2.6205925938720833]
State of the art methods use deep neural networks to predict semantic classes for each point in a LiDAR scan.
A powerful and efficient way to process LiDAR measurements is to use two-dimensional, image-like projections.
We demonstrate various techniques to boost the performance and to improve runtime as well as memory constraints.
arXiv Detail & Related papers (2020-04-06T11:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.