Related papers: ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios

ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios

URL: http://arxiv.org/abs/2601.03011v1
Date: Tue, 06 Jan 2026 13:36:43 GMT
Title: ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios
Authors: Yihan Wei, Shenghai Yuan, Tianchen Deng, Boyang Lou, Enwen Hu,
Abstract summary: We present ReCCur, a framework that converts noisy web imagery into auditable fine-grained labels.<n>On realistic corner-case scenarios, ReCCur runs on consumer-grade GPUs, steadily improves purity and separability.
Score: 14.85600144047706
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Corner cases are rare or extreme scenarios that drive real-world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner-Case Curation), a low-compute framework that converts noisy web imagery into auditable fine-grained labels via a multi-agent recursive pipeline. First, large-scale data acquisition and filtering expands a domain vocabulary with a vision-language model (VLM), crawls the web, and enforces tri-modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture-of-experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual-confidence activation and uncertainty sampling, converging to a high-precision set. Finally, region-evidence VLM adversarial labeling pairs a proposer (multi-granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner-case scenarios (e.g., flooded-car inspection), ReCCur runs on consumer-grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.

Related papers

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z)
Footprint-Guided Exemplar-Free Continual Histopathology Report Generation [3.361593315894868]
We introduce an exemplar-free continual learning framework for WSI-to-report generation.<n>The core idea is a compact domain footprint built in a frozen patch-embedding space.<n>Our approach outperforms exemplar-free and limited-buffer rehearsal baselines.
arXiv Detail & Related papers (2026-02-27T08:58:03Z)
ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization [62.03035862528452]
ForgeryVCR is a framework that materializes imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning.<n>ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks.
arXiv Detail & Related papers (2026-02-15T11:14:47Z)
Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z)
PEARL: Prototype-Enhanced Alignment for Label-Efficient Representation Learning with Deployment-Driven Insights from Digital Governance Communication Systems [7.027521313133687]
We propose PEARL, a label-efficient approach that uses limited supervision to softly align embeddings toward class prototypes.<n>We evaluate PEARL under controlled label regimes ranging from extreme label scarcity to higher-label settings.<n>In the label-scarce condition, PEARL substantially improves local neighborhood quality, yielding 25.7% gains over raw embeddings and more than 21.1% gains relative to strong unsupervised post-processing.
arXiv Detail & Related papers (2026-01-24T15:46:02Z)
Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models [10.230967860299504]
We propose a zero-shot-oriented inspection framework that integrates Retrieval-Augmented Generation with Vision-Language Models.<n>A multimodal knowledge base is constructed, comprising technical documentation, representative reference images, and domain-specific guidelines.<n>We evaluate the framework on 30 labeled blade images covering diverse damage categories.
arXiv Detail & Related papers (2025-10-26T23:19:28Z)
EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels [85.78886153628663]
Open-Set Domain Generalization aims to enable deep learning models to recognize unseen categories in new domains.<n>Label noise hinders open-set domain generalization by corrupting source-domain knowledge.<n>We propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM) to bridge domain gaps.
arXiv Detail & Related papers (2025-10-14T16:23:11Z)
Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry [5.1511135538176]
Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts.<n>We propose Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision.
arXiv Detail & Related papers (2025-10-10T17:50:31Z)
Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings [65.31723739561151]
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics.<n>We introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms.<n>We finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model.
arXiv Detail & Related papers (2025-06-10T09:00:33Z)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Progressive Learning with Cross-Window Consistency for Semi-Supervised Semantic Segmentation [40.00721341952556]
Cross-window consistency (CWC) is helpful in comprehensively extracting auxiliary supervision from unlabeled data. We propose a novel CWC-driven progressive learning framework to optimize the deep network by mining weak-to-strong constraints from massive unlabeled data. In addition, we propose a dynamic pseudo-label memory bank (DPM) to provide high-consistency and high-reliability pseudo-labels.
arXiv Detail & Related papers (2022-11-22T17:31:43Z)
Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning [122.62311703151215]
Divide and Contrast (DaC) aims to connect the good ends of both worlds while bypassing their limitations. DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals. We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch.
arXiv Detail & Related papers (2022-11-12T09:21:49Z)
Dense Label Encoding for Boundary Discontinuity Free Rotation Detection [69.75559390700887]
This paper explores a relatively less-studied methodology based on classification. We propose new techniques to push its frontier in two aspects. Experiments and visual analysis on large-scale public datasets for aerial images show the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-19T05:42:02Z)
Unsupervised Metric Relocalization Using Transform Consistency Loss [66.19479868638925]
Training networks to perform metric relocalization traditionally requires accurate image correspondences. We propose a self-supervised solution, which exploits a key insight: localizing a query image within a map should yield the same absolute pose, regardless of the reference image used for registration. We evaluate our framework on synthetic and real-world data, showing our approach outperforms other supervised methods when a limited amount of ground-truth information is available.
arXiv Detail & Related papers (2020-11-01T19:24:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.