The Progression of Transformers from Language to Vision to MOT: A Literature Review on Multi-Object Tracking with Transformers
- URL: http://arxiv.org/abs/2406.16784v1
- Date: Mon, 24 Jun 2024 16:45:28 GMT
- Title: The Progression of Transformers from Language to Vision to MOT: A Literature Review on Multi-Object Tracking with Transformers
- Authors: Abhi Kamboj,
- Abstract summary: transformer neural network architecture allows for autoregressive sequence-to-sequence modeling.
Transformers have also been applied across a wide variety of pattern recognition tasks, particularly in computer vision.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The transformer neural network architecture allows for autoregressive sequence-to-sequence modeling through the use of attention layers. It was originally created with the application of machine translation but has revolutionized natural language processing. Recently, transformers have also been applied across a wide variety of pattern recognition tasks, particularly in computer vision. In this literature review, we describe major advances in computer vision utilizing transformers. We then focus specifically on Multi-Object Tracking (MOT) and discuss how transformers are increasingly becoming competitive in state-of-the-art MOT works, yet still lag behind traditional deep learning methods.
Related papers
- Vision Language Transformers: A Survey [0.9137554315375919]
Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform.
Recent research has adapted the pretrained transformer architecture introduced in citetvaswani 2017attention to vision language modeling.
Transformer models have greatly improved performance and versatility over previous vision language models.
arXiv Detail & Related papers (2023-07-06T19:08:56Z) - Machine Learning for Brain Disorders: Transformers and Visual
Transformers [4.186575888568896]
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision.
We introduce the Attention mechanism (Section 1), and then the Basic Transformer Block including the Vision Transformer.
Finally, we introduce Visual Transformers applied to tasks other than image classification, such as detection, segmentation, generation and training without labels.
arXiv Detail & Related papers (2023-03-21T17:57:33Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - TransCenter: Transformers with Dense Queries for Multiple-Object
Tracking [87.75122600164167]
We argue that the standard representation -- bounding boxes -- is not adapted to learning transformers for multiple-object tracking.
We propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets.
arXiv Detail & Related papers (2021-03-28T14:49:36Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.