OMG: Observe Multiple Granularities for Natural Language-Based Vehicle
Retrieval
- URL: http://arxiv.org/abs/2204.08209v1
- Date: Mon, 18 Apr 2022 08:15:38 GMT
- Title: OMG: Observe Multiple Granularities for Natural Language-Based Vehicle
Retrieval
- Authors: Yunhao Du, Binyu Zhang, Xiangning Ruan, Fei Su, Zhicheng Zhao and Hong
Chen
- Abstract summary: We propose a novel framework for the natural language-based vehicle retrieval task, which Observes Multiple Granularities.
Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2.
- Score: 33.15778584483565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieving tracked-vehicles by natural language descriptions plays a critical
role in smart city construction. It aims to find the best match for the given
texts from a set of tracked vehicles in surveillance videos. Existing works
generally solve it by a dual-stream framework, which consists of a text
encoder, a visual encoder and a cross-modal loss function. Although some
progress has been made, they failed to fully exploit the information at various
levels of granularity. To tackle this issue, we propose a novel framework for
the natural language-based vehicle retrieval task, OMG, which Observes Multiple
Granularities with respect to visual representation, textual representation and
objective functions. For the visual representation, target features, context
features and motion features are encoded separately. For the textual
representation, one global embedding, three local embeddings and a color-type
prompt embedding are extracted to represent various granularities of semantic
features. Finally, the overall framework is optimized by a cross-modal
multi-granularity contrastive loss function. Experiments demonstrate the
effectiveness of our method. Our OMG significantly outperforms all previous
methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are
available at https://github.com/dyhBUPT/OMG.
Related papers
- Visual Grounding with Multi-modal Conditional Adaptation [14.177510695317098]
Visual grounding is the task of locating objects specified by natural language expressions.
We introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights.
MMCA achieves significant improvements and state-of-the-art results.
arXiv Detail & Related papers (2024-09-08T07:08:58Z) - Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes.
Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Learning Granularity-Unified Representations for Text-to-Image Person
Re-identification [29.04254233799353]
Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions.
Existing works usually ignore the difference in feature granularity between the two modalities.
We propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR.
arXiv Detail & Related papers (2022-07-16T01:26:10Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - All You Can Embed: Natural Language based Vehicle Retrieval with
Spatio-Temporal Transformers [0.981213663876059]
We present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language.
The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information.
For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings.
arXiv Detail & Related papers (2021-06-18T14:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.