Related papers: Hybrid Visual Servoing of Tendon-driven Continuum Robots

Related papers

ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z)
AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection [15.419663374345845]
This paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation.<n>To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy.<n>To maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm.
arXiv Detail & Related papers (2026-01-08T08:56:07Z)
SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding [25.2227348401136]
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling.<n>We present textbfSDAR-VL, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding.<n>We show that SDAR-VL consistently improves emphtraining efficiency, emphconvergence stability, and emphtask performance over conventional block diffusion.
arXiv Detail & Related papers (2025-12-16T04:12:52Z)
$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion [65.77755100137728]
We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens.<n>E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
arXiv Detail & Related papers (2025-11-26T16:14:20Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation [27.007611140797852]
Existing methods optimize inference speed by reducing visual redundancy within VLA models.<n>We propose textbfAction-aware textbfDynamic textbfPruning (textbfADP), a multi-modal pruning framework that integrates text-driven token selection with action-aware trajectory gating.
arXiv Detail & Related papers (2025-09-26T09:13:02Z)
Refine-and-Contrast: Adaptive Instance-Aware BEV Representations for Multi-UAV Collaborative Object Detection [15.494912154439367]
Multi-UAV collaborative 3D detection enables accurate and robust perception by fusing multi-view observations from aerial platforms.<n>We present AdaBEV, a novel framework that learns adaptive instance-aware BEV representations through a refine-and-contrast paradigm.
arXiv Detail & Related papers (2025-08-18T07:37:14Z)
SCALAR: Scale-wise Controllable Visual Autoregressive Learning [15.775596699630633]
We present SCALAR, a controllable generation method based on Visual Autoregressive ( VAR)<n>We leverage a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone.<n>Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model.
arXiv Detail & Related papers (2025-07-26T13:23:08Z)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [69.54069477520534]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z)
CVVNet: A Cross-Vertical-View Network for Gait Recognition [3.9124245851778032]
We propose CVVNet, a frequency aggregation architecture for robust cross-vertical-view gait recognition.<n>CVVNet achieves state-of-the-art performance, with $8.6%$ improvement on DroneGait and $2%$ on Gait3D.
arXiv Detail & Related papers (2025-05-03T14:53:20Z)
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding [54.74721202894622]
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models. We introduce Smooth Robust Latent VAE, a novel adversarial training framework that boosts both generation quality and robustness. Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks.
arXiv Detail & Related papers (2025-04-24T03:17:57Z)
Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [24.1236728596359]
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. We propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations.
arXiv Detail & Related papers (2025-03-04T06:12:08Z)
VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression [59.14355576912495]
NeRF-based video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences.<n>The substantial data volumes pose significant challenges for storage and transmission.<n>We propose VRVVC, a novel end-to-end joint variable-rate framework for video compression.
arXiv Detail & Related papers (2024-12-16T01:28:04Z)
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [12.373320641721344]
Large Vision-Language-Action (VLA) models have shown promise in robotic control due to their impressive generalization ability.<n>Their reliance on VLM backends with billions of parameters leads to high computational costs and latency inference.<n>This paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off.
arXiv Detail & Related papers (2024-09-12T09:18:09Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition [68.6707284662443]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise. One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions. We present an innovative video Retinex-based decomposition strategy that operates without the need for explicit supervision.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
HSIDMamba: Exploring Bidirectional State-Space Models for Hyperspectral Denoising [11.022546457796949]
We propose HSIDMamba(HSDM), tailored to exploit the linear complexity for effectively capturing spatial-spectral dependencies in HSI denoising. HSDM comprises multiple Hyperspectral Continuous Scan Blocks, incorporating BCSM(Bidirectional Continuous Scanning Mechanism), scale residual, and spectral attention mechanisms. BCSM strengthens spatial-spectral interactions by linking forward and backward scans and enhancing information from eight directions through SSM.
arXiv Detail & Related papers (2024-04-15T11:59:19Z)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field. Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z)
VIBR: Learning View-Invariant Value Functions for Robust Visual Control [3.2307366446033945]
VIBR (View-Invariant Bellman Residuals) is a method that combines multi-view training and invariant prediction to reduce out-of-distribution gap for RL based visuomotor control. We show that VIBR outperforms existing methods on complex visuo-motor control environment with high visual perturbation.
arXiv Detail & Related papers (2023-06-14T14:37:34Z)
Efficient Image Super-Resolution with Feature Interaction Weighted Hybrid Network [101.53907377000445]
Lightweight image super-resolution aims to reconstruct high-resolution images from low-resolution images using low computational costs. Existing methods result in the loss of middle-layer features due to activation functions. We propose a Feature Interaction Weighted Hybrid Network (FIWHN) to minimize the impact of intermediate feature loss on reconstruction quality.
arXiv Detail & Related papers (2022-12-29T05:57:29Z)
HDNet: High-resolution Dual-domain Learning for Spectral Compressive Imaging [138.04956118993934]
We propose a high-resolution dual-domain learning network (HDNet) for HSI reconstruction. On the one hand, the proposed HR spatial-spectral attention module with its efficient feature fusion provides continuous and fine pixel-level features. On the other hand, frequency domain learning (FDL) is introduced for HSI reconstruction to narrow the frequency domain discrepancy.
arXiv Detail & Related papers (2022-03-04T06:37:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.