SBNet: Segmentation-based Network for Natural Language-based Vehicle
Search
- URL: http://arxiv.org/abs/2104.11589v1
- Date: Thu, 22 Apr 2021 08:06:17 GMT
- Title: SBNet: Segmentation-based Network for Natural Language-based Vehicle
Search
- Authors: Sangrok Lee, Taekang Woo, Sang Hun Lee
- Abstract summary: Natural language-based vehicle retrieval is a task to find a target vehicle within a given image based on a natural language description as a query.
This technology can be applied to various areas including police searching for a suspect vehicle.
We propose a deep neural network called SBNet that performs natural language-based segmentation for vehicle retrieval.
- Score: 8.286899656309476
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Natural language-based vehicle retrieval is a task to find a target vehicle
within a given image based on a natural language description as a query. This
technology can be applied to various areas including police searching for a
suspect vehicle. However, it is challenging due to the ambiguity of language
descriptions and the difficulty of processing multi-modal data. To tackle this
problem, we propose a deep neural network called SBNet that performs natural
language-based segmentation for vehicle retrieval. We also propose two
task-specific modules to improve performance: a substitution module that helps
features from different domains to be embedded in the same space and a future
prediction module that learns temporal information. SBnet has been trained
using the CityFlow-NL dataset that contains 2,498 tracks of vehicles with three
unique natural language descriptions each and tested 530 unique vehicle tracks
and their corresponding query sets. SBNet achieved a significant improvement
over the baseline in the natural language-based vehicle tracking track in the
AI City Challenge 2021.
Related papers
- MENTOR: Multilingual tExt detectioN TOward leaRning by analogy [59.37382045577384]
We propose a framework to detect and identify both seen and unseen language regions inside scene images.
"MENTOR" is the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
arXiv Detail & Related papers (2024-03-12T03:35:17Z) - Language Prompt for Autonomous Driving [58.45334918772529]
We propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt.
It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks.
Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, ie, employing a language prompt to predict the described object trajectory across views and frames.
arXiv Detail & Related papers (2023-09-08T15:21:07Z) - Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding.
It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector.
Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z) - FindVehicle and VehicleFinder: A NER dataset for natural language-based
vehicle retrieval and a keyword-based cross-modal vehicle retrieval system [7.078561467480664]
Natural language (NL) based vehicle retrieval is a task aiming to retrieve a vehicle that is most consistent with a given NL query from among all candidate vehicles.
To tackle these problems and simplify, we borrow the idea from named entity recognition (NER) and construct FindVehicle, a NER dataset in the traffic domain.
VehicleFinder achieves 87.7% precision and 89.4% recall when retrieving a target vehicle by text command on our homemade dataset.
arXiv Detail & Related papers (2023-04-21T11:20:23Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Symmetric Network with Spatial Relationship Modeling for Natural
Language-based Vehicle Retrieval [3.610372087454382]
Natural language (NL) based vehicle retrieval aims to search specific vehicle given text description.
We propose a Symmetric Network with Spatial Relationship Modeling (SSM) method for NL-based vehicle retrieval.
We achieve 43.92% MRR accuracy on the test set of the 6th AI City Challenge on natural language-based vehicle retrieval track.
arXiv Detail & Related papers (2022-06-22T07:02:04Z) - All You Can Embed: Natural Language based Vehicle Retrieval with
Spatio-Temporal Transformers [0.981213663876059]
We present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language.
The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information.
For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings.
arXiv Detail & Related papers (2021-06-18T14:38:51Z) - Connecting Language and Vision for Natural Language-Based Vehicle
Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest.
To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model.
Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z) - Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.