Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving
- URL: http://arxiv.org/abs/2405.01691v1
- Date: Thu, 2 May 2024 19:27:28 GMT
- Title: Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving
- Authors: Zhenjiang Mao, Dong-You Jhong, Ao Wang, Ivan Ruchkin,
- Abstract summary: multimodal inputs offer the possibility of taking human language as a latent representation.
In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation.
Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder.
- Score: 1.3499500088995464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
Related papers
- LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - Driver Activity Classification Using Generalizable Representations from Vision-Language Models [0.0]
We present a novel approach leveraging generalizable representations from vision-language models for driver activity classification.
Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems.
arXiv Detail & Related papers (2024-04-23T10:42:24Z) - Detecting out-of-distribution text using topological features of transformer-based language models [0.5735035463793009]
We explore the use of topological features of self-attention maps from transformer-based language models to detect when input text is out of distribution.
We evaluate our approach on BERT and compare it to a traditional OOD approach using CLS embeddings.
Our results show that our approach outperforms CLS embeddings in distinguishing in-distribution samples from far-out-of-domain samples, but struggles with near or same-domain datasets.
arXiv Detail & Related papers (2023-11-22T02:04:35Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Towards Learning Discrete Representations via Self-Supervision for
Wearables-Based Human Activity Recognition [7.086647707011785]
Human activity recognition (HAR) in wearable computing is typically based on direct processing of sensor data.
Recent advancements in Vector Quantization (VQ) to wearables applications enables us to directly learn a mapping between short spans of sensor data and a codebook of vectors.
This work presents a proof-of-concept for demonstrating how effective discrete representations can be derived.
arXiv Detail & Related papers (2023-06-01T19:49:43Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.