Related papers: DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

Related papers

CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control [39.17038025776311]
CARE is a framework designed to train VLA models for robotic task execution.<n> CARE eliminates the need for explicit action labels by leveraging only video-text pairs.<n>Results demonstrate CARE's scalability, interpretability, and effectiveness in robotic control with weak supervision.
arXiv Detail & Related papers (2026-01-30T02:28:32Z)
Solving Context Window Overflow in AI Agents [0.0]
Large Language Models (LLMs) have become increasingly capable of interacting with external tools, granting access to specialized knowledge beyond their training data.<n>Existing solutions such as truncation or summarization fail to preserve complete outputs, making them unsuitable for work requiring the full data.<n>This paper introduces a method that enables LLMs to process and utilize tool responses of arbitrary length without loss of information.
arXiv Detail & Related papers (2025-11-27T19:22:20Z)
Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z)
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models [63.05306474002547]
Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning.<n>We introduce AUVIC, a novel visual concept unlearning framework for MLLMs.<n>We show that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.
arXiv Detail & Related papers (2025-11-14T13:35:32Z)
MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering [36.80441487363007]
MLLMEraser is an input-aware, training-free framework for test-time unlearning.<n>We construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs.<n>Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines.
arXiv Detail & Related papers (2025-10-05T14:20:17Z)
VILOD: A Visual Interactive Labeling Tool for Object Detection [0.0]
This thesis develops and investigates "VILOD: A Visual Interactive Labeling tool for Object Detection"<n>It enables users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for Object Detection.<n>The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories.
arXiv Detail & Related papers (2025-08-29T19:27:10Z)
Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation [36.17444261325021]
Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions.<n>Existing methods rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios.<n>We propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent's ability to identify objects from dynamic viewpoints in VLN scenarios without requiring VLM fine-tuning.
arXiv Detail & Related papers (2025-06-18T11:43:50Z)
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning [66.84770041828462]
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks.<n> Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics.<n>We propose VisuRiddles, a benchmark for PRS, featuring tasks meticulously constructed to assess models' reasoning capacities.<n>Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions.
arXiv Detail & Related papers (2025-06-03T07:24:00Z)
Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation [14.977743061489518]
We introduce Object-Focus Actor (OFA), a novel, data-efficient approach for generalized dexterous manipulation.<n>OFA exploits the consistent end trajectories observed in dexterous manipulation tasks, allowing for efficient policy training.<n>OFA achieves robust performance with only 10 demonstrations, highlighting its data efficiency.
arXiv Detail & Related papers (2025-05-21T04:37:56Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
Learning Free Token Reduction for Multi-Modal LLM [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks.<n>However, their practical deployment is often constrained by high computational costs and prolonged inference times.<n>We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z)
Visual RAG: Expanding MLLM visual knowledge without fine-tuning [5.341192792319891]
This paper introduces Visual RAG, that synergically combines the MLLMs capability to learn from the context, with a retrieval mechanism. In this way, the resulting system is not limited to the knowledge extracted from the training data, but can be updated rapidly and easily without fine-tuning. It greatly reduces the computational costs for improving the model image classification performance, and augments the model knowledge to new visual domains and tasks it was not trained for.
arXiv Detail & Related papers (2025-01-18T17:43:05Z)
Unified Parameter-Efficient Unlearning for LLMs [25.195126838721492]
Large Language Models (LLMs) have revolutionized natural language processing, enabling advanced understanding and reasoning capabilities across a variety of tasks.<n>This raises significant privacy and security concerns, as models may inadvertently retain and disseminate sensitive or undesirable information.<n>We introduce a novel instance-wise unlearning framework, LLMEraser, which systematically categorizes unlearning tasks and applies precise adjustments using influence functions.
arXiv Detail & Related papers (2024-11-30T07:21:02Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction [60.80423207808076]
State Space Models (SSMs) with efficient hardware-aware designs have demonstrated significant potential in computer vision tasks. These models have been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. We introduce the Dynamic Visual State Space (DVSS) block, which employs deformable convolution to mitigate the long-range forgetting problem. We also introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
VIRL: Volume-Informed Representation Learning towards Few-shot Manufacturability Estimation [0.0]
This work introduces VIRL, a Volume-Informed Representation Learning approach to pre-train a 3D geometric encoder. The model pre-trained by VIRL shows substantial enhancements on demonstrating improved generalizability with limited data.
arXiv Detail & Related papers (2024-06-18T05:30:26Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [49.32669226551026]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.<n>MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.<n>Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z)
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)
Improving Multi-task Learning via Seeking Task-based Flat Regions [38.28600737969538]
Multi-Task Learning (MTL) is a powerful learning paradigm for training deep neural networks that allows learning more than one objective by a single backbone. There is an emerging line of work in MTL that focuses on manipulating the task gradient to derive an ultimate gradient descent direction. We propose to leverage a recently introduced training method, named Sharpness-aware Minimization, which can enhance model generalization ability on single-task learning.
arXiv Detail & Related papers (2022-11-24T17:19:30Z)
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.