RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training
- URL: http://arxiv.org/abs/2602.12892v1
- Date: Fri, 13 Feb 2026 12:56:31 GMT
- Title: RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training
- Authors: Yunshuang Nie, Bingqian Lin, Minzhe Niu, Kun Xiang, Jianhua Han, Guowei Huang, Xingyue Quan, Hang Xu, Bokui Chen, Xiaodan Liang,
- Abstract summary: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training.<n>Current evaluation relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs.<n>We propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training.
- Score: 59.493415006017635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs' perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.
Related papers
- Towards Lifecycle Unlearning Commitment Management: Measuring Sample-level Unlearning Completeness [30.596695293390415]
Interpolated Approximate Measurement (IAM) is a framework designed for unlearning inference.<n>IAM quantifies sample-level unlearning completeness by interpolating the model's generalization-fitting behavior gap on queried samples.<n>We apply IAM to recent approximate unlearning algorithms, revealing general risks of both over-unlearning and under-unlearning.
arXiv Detail & Related papers (2025-06-06T14:22:18Z) - Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches [46.0474342507327]
We introduce Teach2Eval, an indirect evaluation framework inspired by the Feynman Technique.<n>Our method evaluates a model's multiple abilities to teach weaker student models to perform tasks effectively.
arXiv Detail & Related papers (2025-05-18T06:51:10Z) - Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective [7.408649506385476]
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance.<n>Current prediction methods lack accuracy and reliability.<n>We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction.
arXiv Detail & Related papers (2025-02-24T15:44:57Z) - Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons [9.954960702259918]
This paper introduces Themis, a fine-tuned large language model (LLMs) judge that delivers context-aware evaluations.<n>We provide a comprehensive overview of the development pipeline for Themis, highlighting its scenario-dependent evaluation prompts.<n>We introduce two human-labeled benchmarks for meta-evaluation, demonstrating that Themis can achieve high alignment with human preferences in an economical manner.
arXiv Detail & Related papers (2025-02-05T08:35:55Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Augmenting Unsupervised Reinforcement Learning with Self-Reference [63.68018737038331]
Humans possess the ability to draw on past experiences explicitly when learning new tasks.
We propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information.
Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark.
arXiv Detail & Related papers (2023-11-16T09:07:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.