All You Can Embed: Natural Language based Vehicle Retrieval with
Spatio-Temporal Transformers
- URL: http://arxiv.org/abs/2106.10153v1
- Date: Fri, 18 Jun 2021 14:38:51 GMT
- Title: All You Can Embed: Natural Language based Vehicle Retrieval with
Spatio-Temporal Transformers
- Authors: Carmelo Scribano, Davide Sapienza, Giorgia Franchini, Micaela Verucchi
and Marko Bertogna
- Abstract summary: We present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language.
The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information.
For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings.
- Score: 0.981213663876059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Combining Natural Language with Vision represents a unique and interesting
challenge in the domain of Artificial Intelligence. The AI City Challenge Track
5 for Natural Language-Based Vehicle Retrieval focuses on the problem of
combining visual and textual information, applied to a smart-city use case. In
this paper, we present All You Can Embed (AYCE), a modular solution to
correlate single-vehicle tracking sequences with natural language. The main
building blocks of the proposed architecture are (i) BERT to provide an
embedding of the textual descriptions, (ii) a convolutional backbone along with
a Transformer model to embed the visual information. For the training of the
retrieval model, a variation of the Triplet Margin Loss is proposed to learn a
distance measure between the visual and language embeddings. The code is
publicly available at https://github.com/cscribano/AYCE_2021.
Related papers
- Language Prompt for Autonomous Driving [58.45334918772529]
We propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt.
It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks.
Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, ie, employing a language prompt to predict the described object trajectory across views and frames.
arXiv Detail & Related papers (2023-09-08T15:21:07Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - OMG: Observe Multiple Granularities for Natural Language-Based Vehicle
Retrieval [33.15778584483565]
We propose a novel framework for the natural language-based vehicle retrieval task, which Observes Multiple Granularities.
Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2.
arXiv Detail & Related papers (2022-04-18T08:15:38Z) - Open-Vocabulary DETR with Conditional Matching [86.1530128487077]
OV-DETR is an open-vocabulary detector based on DETR.
It can detect any object given its class name or an exemplar image.
It achieves non-trivial improvements over current state of the arts.
arXiv Detail & Related papers (2022-03-22T16:54:52Z) - Language Model-Based Paired Variational Autoencoders for Robotic Language Learning [18.851256771007748]
Similar to human infants, artificial agents can learn language while interacting with their environment.
We present a neural model that bidirectionally binds robot actions and their language descriptions in a simple object manipulation scenario.
Next, we introduce PVAE-BERT, which equips the model with a pretrained large-scale language model.
arXiv Detail & Related papers (2022-01-17T10:05:26Z) - Connecting Language and Vision for Natural Language-Based Vehicle
Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest.
To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model.
Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z) - SBNet: Segmentation-based Network for Natural Language-based Vehicle
Search [8.286899656309476]
Natural language-based vehicle retrieval is a task to find a target vehicle within a given image based on a natural language description as a query.
This technology can be applied to various areas including police searching for a suspect vehicle.
We propose a deep neural network called SBNet that performs natural language-based segmentation for vehicle retrieval.
arXiv Detail & Related papers (2021-04-22T08:06:17Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Going Full-TILT Boogie on Document Understanding with Text-Image-Layout
Transformer [0.6702423358056857]
We introduce the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics.
We trained our network on real-world documents with different layouts, such as tables, figures, and forms.
arXiv Detail & Related papers (2021-02-18T18:51:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.