CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
- URL: http://arxiv.org/abs/2405.03194v1
- Date: Mon, 6 May 2024 06:38:49 GMT
- Title: CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
- Authors: Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, Zhen Xie,
- Abstract summary: Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention.
This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios.
- Score: 19.730287885060633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing, including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance, our method achieved a benchmark score of 33.4308, securing the leading position on the leaderboard. The code can be found: https://github.com/alibaba/AICITY2024_Track2_AliOpenTrek_CityLLaVA
Related papers
- Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models [31.566051946153802]
Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios.<n>We introduce Impromptu VLA: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets.<n>This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories.
arXiv Detail & Related papers (2025-05-29T17:59:46Z) - VPO: Aligning Text-to-Video Generation Models with Prompt Optimization [80.86205966195593]
Video generation models are typically trained on text-to-video pairs with highly detailed and carefully crafted descriptions.
We introduce VPO, a principled framework that optimize prompts based on three core principles: harmlessness, accuracy, and helpfulness.
Our experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods.
arXiv Detail & Related papers (2025-03-26T12:28:20Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [10.748597086208145]
In this work, we propose a novel method that also incorporates visual input from surround-view cameras.
Our method achieves a latency of 53 ms, making it feasible for real-time processing.
Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance.
arXiv Detail & Related papers (2024-07-17T06:39:52Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - VAD: Vectorized Scene Representation for Efficient Autonomous Driving [44.070636456960045]
VAD is an end-to-end vectorized paradigm for autonomous driving.
VAD exploits the vectorized agent motion and map elements as explicit instance-level planning constraints.
VAD runs much faster than previous end-to-end planning methods.
arXiv Detail & Related papers (2023-03-21T17:59:22Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Adaptive Siamese Tracking with a Compact Latent Network [219.38172719948048]
We present an intuitive viewing to simplify the Siamese-based trackers by converting the tracking task to a classification.
Under this viewing, we perform an in-depth analysis for them through visual simulations and real tracking examples.
We apply it to adjust three classical Siamese-based trackers, namely SiamRPN++, SiamFC, and SiamBAN.
arXiv Detail & Related papers (2023-02-02T08:06:02Z) - Declaration-based Prompt Tuning for Visual Question Answering [16.688288454811016]
We propose an innovative visual-language (VL) fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT)
DPT jointly optimize the objectives of pre-training and fine-tuning of VQA model, boosting the effective adaptation of pre-trained VL models to the downstream task.
Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/few-shot (over 31%) settings.
arXiv Detail & Related papers (2022-05-05T05:56:55Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z) - Action Sequence Predictions of Vehicles in Urban Environments using Map
and Social Context [152.0714518512966]
This work studies the problem of predicting the sequence of future actions for surround vehicles in real-world driving scenarios.
The first contribution is an automatic method to convert the trajectories recorded in real-world driving scenarios to action sequences with the help of HD maps.
The second contribution lies in applying the method to the well-known traffic agent tracking and prediction dataset Argoverse, resulting in 228,000 action sequences.
The third contribution is to propose a novel action sequence prediction method by integrating past positions and velocities of the traffic agents, map information and social context into a single end-to-end trainable neural network
arXiv Detail & Related papers (2020-04-29T14:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.