Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024
- URL: http://arxiv.org/abs/2412.07247v1
- Date: Tue, 10 Dec 2024 07:13:39 GMT
- Title: Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024
- Authors: Jiahan Li, Zhiqi Li, Tong Lu,
- Abstract summary: This report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge.
We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-fledged fine-tuning on the competition dataset, DriveLM-nuScenes.
Our single model achieved a score of 0.6002 on the final leadboard.
- Score: 23.193095382776725
- License:
- Abstract: This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.
Related papers
- Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy [111.1291107651131]
Long-VITA is a large multi-modal model for long-context visual-language understanding tasks.
It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.
Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
arXiv Detail & Related papers (2025-02-07T18:59:56Z) - Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge [8.941623670652389]
This report outlines the methodologies we applied for the PRCV Challenge.
It focuses on cognition and decision-making in driving scenarios.
Our model achieved a score of 0.6064, securing the first prize on the competition's final results.
arXiv Detail & Related papers (2024-11-05T11:00:55Z) - How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM)
It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z) - Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [0.0]
We develop an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving.
In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations.
arXiv Detail & Related papers (2024-03-28T21:18:33Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels.
We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z) - 1st Place in ICCV 2023 Workshop Challenge Track 1 on Resource Efficient
Deep Learning for Computer Vision: Budgeted Model Training Challenge [15.213786895534225]
We describe a resource-aware backbone search framework composed of profile and instantiation phases.
We employ multi-resolution ensembles to boost inference accuracy on limited resources.
Based on our approach, we win first place in International Conference on Computer Vision (ICCV) 2023 Workshop Challenge Track 1 on Resource Efficient Deep Learning for Computer Vision (RCV)
arXiv Detail & Related papers (2023-08-09T05:38:18Z) - Q-YOLOP: Quantization-aware You Only Look Once for Panoptic Driving
Perception [6.3709120604927945]
We present an efficient and quantization-aware panoptic driving perception model (Q- YOLOP) for object detection, drivable area segmentation, and lane line segmentation.
The proposed model achieves state-of-the-art performance with an mAP@0.5 of 0.622 for object detection and an mIoU of 0.612 for segmentation.
arXiv Detail & Related papers (2023-07-10T13:02:46Z) - Visual Exemplar Driven Task-Prompting for Unified Perception in
Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting.
Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories.
We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z) - CERBERUS: Simple and Effective All-In-One Automotive Perception Model
with Multi Task Learning [4.622165486890318]
In-vehicle embedded computing platforms cannot cope with the computational effort required to run a heavy model for each individual task.
We present CERBERUS, a lightweight model that leverages a multitask-learning approach to enable the execution of multiple perception tasks at the cost of a single inference.
arXiv Detail & Related papers (2022-10-03T08:17:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.