Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge
- URL: http://arxiv.org/abs/2411.02999v1
- Date: Tue, 05 Nov 2024 11:00:55 GMT
- Title: Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge
- Authors: Bin Huang, Siyu Wang, Yuanpeng Chen, Yidan Wu, Hui Song, Zifan Ding, Jing Leng, Chengpeng Liang, Peng Xue, Junliang Zhang, Tiankun Zhao,
- Abstract summary: This report outlines the methodologies we applied for the PRCV Challenge.
It focuses on cognition and decision-making in driving scenarios.
Our model achieved a score of 0.6064, securing the first prize on the competition's final results.
- Score: 8.941623670652389
- License:
- Abstract: This technical report outlines the methodologies we applied for the PRCV Challenge, focusing on cognition and decision-making in driving scenarios. We employed InternVL-2.0, a pioneering open-source multi-modal model, and enhanced it by refining both the model input and training methodologies. For the input data, we strategically concatenated and formatted the multi-view images. It is worth mentioning that we utilized the coordinates of the original images without transformation. In terms of model training, we initially pre-trained the model on publicly available autonomous driving scenario datasets to bolster its alignment capabilities of the challenge tasks, followed by fine-tuning on the DriveLM-nuscenes Dataset. During the fine-tuning phase, we innovatively modified the loss function to enhance the model's precision in predicting coordinate values. These approaches ensure that our model possesses advanced cognitive and decision-making capabilities in driving scenarios. Consequently, our model achieved a score of 0.6064, securing the first prize on the competition's final results.
Related papers
- AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction [56.72301849123049]
We present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ dataset challenge at CVPR 2024.
Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling.
Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth.
arXiv Detail & Related papers (2024-07-01T16:32:15Z) - SalFoM: Dynamic Saliency Prediction with Video Foundation Models [37.25208752620703]
Video saliency prediction (VSP) has shown promising performance compared to the human visual system.
We introduce SalFoM, a novel encoder-decoder video transformer architecture.
Our model employs UnMasked Teacher (UMT) extractor and presents a heterogeneous decoder-aware informationtemporal transformer.
arXiv Detail & Related papers (2024-04-03T22:38:54Z) - Data Quality Aware Approaches for Addressing Model Drift of Semantic
Segmentation Models [1.6385815610837167]
This study investigates two prominent quality aware strategies to combat model drift.
The former leverages image quality assessment metrics to meticulously select high-quality training data, improving the model robustness.
The latter makes use of learned vectors feature from existing models to guide the selection of future data, aligning it with the model's prior knowledge.
arXiv Detail & Related papers (2024-02-11T18:01:52Z) - Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition.
We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z) - 1st Place in ICCV 2023 Workshop Challenge Track 1 on Resource Efficient
Deep Learning for Computer Vision: Budgeted Model Training Challenge [15.213786895534225]
We describe a resource-aware backbone search framework composed of profile and instantiation phases.
We employ multi-resolution ensembles to boost inference accuracy on limited resources.
Based on our approach, we win first place in International Conference on Computer Vision (ICCV) 2023 Workshop Challenge Track 1 on Resource Efficient Deep Learning for Computer Vision (RCV)
arXiv Detail & Related papers (2023-08-09T05:38:18Z) - Towards Efficient Task-Driven Model Reprogramming with Foundation Models [52.411508216448716]
Vision foundation models exhibit impressive power, benefiting from the extremely large model capacity and broad training data.
However, in practice, downstream scenarios may only support a small model due to the limited computational resources or efficiency considerations.
This brings a critical challenge for the real-world application of foundation models: one has to transfer the knowledge of a foundation model to the downstream task.
arXiv Detail & Related papers (2023-04-05T07:28:33Z) - Confidence Attention and Generalization Enhanced Distillation for
Continuous Video Domain Adaptation [62.458968086881555]
Continuous Video Domain Adaptation (CVDA) is a scenario where a source model is required to adapt to a series of individually available changing target domains.
We propose a Confidence-Attentive network with geneRalization enhanced self-knowledge disTillation (CART) to address the challenge in CVDA.
arXiv Detail & Related papers (2023-03-18T16:40:10Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Source-Free Open Compound Domain Adaptation in Semantic Segmentation [99.82890571842603]
In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model.
We propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level.
Our method produces state-of-the-art results on the C-Driving dataset.
arXiv Detail & Related papers (2021-06-07T08:38:41Z) - Incorporating Orientations into End-to-end Driving Model for Steering
Control [12.163394005517766]
We present a novel end-to-end deep neural network model for autonomous driving.
It takes monocular image sequence as input, and directly generates the steering control angle.
Our dataset includes multiple driving scenarios, such as urban, country, and off-road.
arXiv Detail & Related papers (2021-03-10T03:14:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.