Can you text what is happening? Integrating pre-trained language
encoders into trajectory prediction models for autonomous driving
- URL: http://arxiv.org/abs/2309.05282v2
- Date: Wed, 13 Sep 2023 10:38:28 GMT
- Title: Can you text what is happening? Integrating pre-trained language
encoders into trajectory prediction models for autonomous driving
- Authors: Ali Keysan, Andreas Look, Eitan Kosman, Gonca G\"ursun, J\"org Wagner,
Yu Yao, Barbara Rakitsch
- Abstract summary: In autonomous driving tasks, scene understanding is the first step towards predicting the future behavior of the surrounding traffic participants.
We propose a novel text-based representation of traffic scenes and process it with a language encoder.
- Score: 14.45922081268057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In autonomous driving tasks, scene understanding is the first step towards
predicting the future behavior of the surrounding traffic participants. Yet,
how to represent a given scene and extract its features are still open research
questions. In this study, we propose a novel text-based representation of
traffic scenes and process it with a pre-trained language encoder.
First, we show that text-based representations, combined with classical
rasterized image representations, lead to descriptive scene embeddings. Second,
we benchmark our predictions on the nuScenes dataset and show significant
improvements compared to baselines. Third, we show in an ablation study that a
joint encoder of text and rasterized images outperforms the individual encoders
confirming that both representations have their complementary strengths.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - FILS: Self-Supervised Video Feature Prediction In Semantic Language Space [11.641926922266347]
This paper demonstrates a self-supervised approach for learning semantic video representations.
We present FILS, a novel self-supervised video Feature prediction In semantic Language Space.
arXiv Detail & Related papers (2024-06-05T16:44:06Z) - PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval [43.2299969152561]
Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
arXiv Detail & Related papers (2022-04-26T16:06:31Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.