Related papers: Balancing Performance and Efficiency in Zero-shot Robotic Navigation

Balancing Performance and Efficiency in Zero-shot Robotic Navigation

URL: http://arxiv.org/abs/2406.03015v1
Date: Wed, 5 Jun 2024 07:31:05 GMT
Title: Balancing Performance and Efficiency in Zero-shot Robotic Navigation
Authors: Dmytro Kuzmenko, Nadiya Shvai,
Abstract summary: We present an optimization study of the Vision-Language Frontier Maps applied to the Object Goal Navigation task in robotics. Our work evaluates the efficiency and performance of various vision-language models, object detectors, segmentation models, and Visual Question Answering modules.
Score: 1.6574413179773757
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an optimization study of the Vision-Language Frontier Maps (VLFM) applied to the Object Goal Navigation task in robotics. Our work evaluates the efficiency and performance of various vision-language models, object detectors, segmentation models, and multi-modal comprehension and Visual Question Answering modules. Using the $\textit{val-mini}$ and $\textit{val}$ splits of Habitat-Matterport 3D dataset, we conduct experiments on a desktop with limited VRAM. We propose a solution that achieves a higher success rate (+1.55%) improving over the VLFM BLIP-2 baseline without substantial success-weighted path length loss while requiring $\textbf{2.3 times}$ less video memory. Our findings provide insights into balancing model performance and computational efficiency, suggesting effective deployment strategies for resource-limited environments.

Related papers

EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Our approach only leverages a small number of samples to search for the desired pruning policy. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving. Recent methods leverage the range-view representation to improve processing efficiency. We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z)
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone. The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z)
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z)
Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval [1.6874375111244329]
State-of-the-art image retrieval systems train specific neural networks for each dataset. Off-the-shelf foundation models fall short in achieving performance comparable to dataset-specific models. We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which significantly improves the performance of foundation models.
arXiv Detail & Related papers (2024-10-09T16:05:16Z)
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z)
PLANRL: A Motion Planning and Imitation Learning Framework to Bootstrap Reinforcement Learning [13.564676246832544]
We introduce PLANRL, a framework that chooses when the robot should use classical motion planning and when it should learn a policy. PLANRL switches between two modes of operation: reaching a waypoint using classical techniques when away from the objects and fine-grained manipulation control when about to interact with objects. We evaluate our approach across multiple challenging simulation environments and real-world tasks, demonstrating superior performance in terms of adaptability, efficiency, and generalization compared to existing methods.
arXiv Detail & Related papers (2024-08-07T19:30:08Z)
M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension [36.01063804442098]
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. PETL methods have shown strong performance with fewer tunable parameters. We present M$2$IST: Multi-Modal Interactive Side-Tuning with M$3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters.
arXiv Detail & Related papers (2024-07-01T09:53:53Z)
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba [19.062950348441426]
This work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks.
arXiv Detail & Related papers (2024-03-15T02:48:47Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [27.930351465266515]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs. MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers. Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
MiniVLM: A Smaller and Faster Vision-Language Model [76.35880443015493]
MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module. MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
arXiv Detail & Related papers (2020-12-13T03:02:06Z)
Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection [91.43066633305662]
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information. In this paper, we explore these issues from a new perspective. We implement a kind of more flexible and efficient multi-scale cross-modal feature processing.
arXiv Detail & Related papers (2020-07-13T07:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.