VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic
Self-Supervision
- URL: http://arxiv.org/abs/2304.03135v1
- Date: Thu, 6 Apr 2023 15:16:29 GMT
- Title: VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic
Self-Supervision
- Authors: Mengyin Liu, Jie Jiang, Chao Zhu, Xu-Cheng Yin
- Abstract summary: We propose a novel approach via Vision-Language semantic self-supervision for context-aware Pedestrian Detection.
First, we propose a self-supervised Vision-Language Semantic (VLS) segmentation method, which learns both fully-supervised pedestrian detection and contextual segmentation.
Second, a self-supervised Prototypical Semantic Contrastive (PSC) learning method is proposed to better discriminate pedestrians and other classes.
- Score: 13.268399018823903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting pedestrians accurately in urban scenes is significant for realistic
applications like autonomous driving or video surveillance. However, confusing
human-like objects often lead to wrong detections, and small scale or heavily
occluded pedestrians are easily missed due to their unusual appearances. To
address these challenges, only object regions are inadequate, thus how to fully
utilize more explicit and semantic contexts becomes a key problem. Meanwhile,
previous context-aware pedestrian detectors either only learn latent contexts
with visual clues, or need laborious annotations to obtain explicit and
semantic contexts. Therefore, we propose in this paper a novel approach via
Vision-Language semantic self-supervision for context-aware Pedestrian
Detection (VLPD) to model explicitly semantic contexts without any extra
annotations. Firstly, we propose a self-supervised Vision-Language Semantic
(VLS) segmentation method, which learns both fully-supervised pedestrian
detection and contextual segmentation via self-generated explicit labels of
semantic classes by vision-language models. Furthermore, a self-supervised
Prototypical Semantic Contrastive (PSC) learning method is proposed to better
discriminate pedestrians and other classes, based on more explicit and semantic
contexts obtained from VLS. Extensive experiments on popular benchmarks show
that our proposed VLPD achieves superior performances over the previous
state-of-the-arts, particularly under challenging circumstances like small
scale and heavy occlusion. Code is available at
https://github.com/lmy98129/VLPD.
Related papers
- Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z) - Point What You Mean: Visually Grounded Instruction Policy [42.52502990975079]
Point-VLA is a plug-and-play policy that augments language instructions with explicit visual cues to resolve referential ambiguity.<n>We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs.
arXiv Detail & Related papers (2025-12-22T00:44:19Z) - Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
arXiv Detail & Related papers (2025-12-12T01:59:23Z) - Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability [31.30541946703775]
Translating internal representations and computations of models into concepts that humans can understand is a key goal of interpretability.<n>Recent dictionary learning methods such as Sparse Autoencoders provide a promising route to discover human-interpretable features.<n>But they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences"
arXiv Detail & Related papers (2025-10-30T17:59:30Z) - Context Matters: Learning Global Semantics via Object-Centric Representation [8.195437248815802]
Vision models have yet to exhibit comparable progress in in-context learning.<n>We argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes.<n>We propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements.
arXiv Detail & Related papers (2025-10-07T08:33:36Z) - VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments [0.0]
We propose a vision-language framework that models the changing landscape of drivers' gaze through natural language.<n>Our approach integrates both low-level cues and top-down context, enabling language-based descriptions of gaze behavior.<n>Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability.
arXiv Detail & Related papers (2025-08-07T21:01:43Z) - Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z) - Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments [6.295098866364597]
We propose an open-vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs)
Our approach follows a 'Segment Detect Select' framework for open-vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.
arXiv Detail & Related papers (2025-03-29T14:46:45Z) - OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.
Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.
We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z) - Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [2.0122032639916485]
We analyze effective knowledge distillation of semantic labels to smaller Vision networks.
This can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
arXiv Detail & Related papers (2025-01-12T01:31:07Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization [3.996503381756227]
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations.
We propose a novel framework that aligns human action knowledge and semantic knowledge in a probabilistic embedding space.
Our method significantly outperforms all previous state-of-the-art methods.
arXiv Detail & Related papers (2024-08-12T07:09:12Z) - OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - BID: Boundary-Interior Decoding for Unsupervised Temporal Action
Localization Pre-Trainin [13.273908640951252]
We propose the first unsupervised pre-training framework that partitions a skeleton-based motion sequence into semantically meaningful pre-action segments.
By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
arXiv Detail & Related papers (2024-03-12T06:23:45Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Linguistic More: Taking a Further Step toward Efficient and Accurate
Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task.
Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper.
We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.