Related papers: Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

URL: http://arxiv.org/abs/2510.07632v1
Date: Thu, 09 Oct 2025 00:00:49 GMT
Title: Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Authors: Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang,
Abstract summary: We show that widely used evaluation metrics systematically underestimate model capability.<n>We introduce a group matching score that better exploits group structure and reveals substantial hidden capability.<n>We propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision.
Score: 9.972892886403228
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

Related papers

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta [0.0]
classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges.<n>We propose a robust framework that integrates the hybrid CoAtNet architecture with model soups.<n>Our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines.
arXiv Detail & Related papers (2026-03-02T18:50:15Z)
STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z)
Limits and Gains of Test-Time Scaling in Vision-Language Reasoning [8.76012279865596]
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference.<n>We present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source Vision-Language Models (VLMs) on different benchmarks.
arXiv Detail & Related papers (2025-12-11T20:48:54Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Advancing Sentiment Analysis: A Novel LSTM Framework with Multi-head Attention [0.0]
This work proposes an LSTM-based sentiment classification model with multi-head attention mechanism and TF-IDF optimization.<n> Experimental results on public data sets demonstrate that the new method achieves substantial improvements in the most critical metrics like accuracy, recall, and F1-score.
arXiv Detail & Related papers (2025-03-11T06:21:49Z)
TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos [17.41208629642756]
This study proposes TB-Bench, a benchmark to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views.<n>We also introduce vision- instruction tuning, TB-100k and TB-250k, along with simple yet effective baselines for the tasks.<n>In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks.
arXiv Detail & Related papers (2025-01-10T06:02:06Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation [50.00235162432848]
We train ALMA models with only 22K parallel sentences and 12M parameters. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4.
arXiv Detail & Related papers (2024-01-16T15:04:51Z)
Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series [11.635608108358575]
We introduce Tiny Time Mixers (TTM), a compact model with effective transfer learning capabilities, trained exclusively on public TS datasets. TTM incorporates innovations like adaptive patching, diverse resolution sampling, and resolution prefix tuning to handle pre-training on varied dataset resolutions. It outperforms existing popular benchmarks in zero/few-shot forecasting by (4-40%), while reducing computational requirements significantly.
arXiv Detail & Related papers (2024-01-08T15:21:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.