MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
- URL: http://arxiv.org/abs/2510.23479v1
- Date: Mon, 27 Oct 2025 16:12:40 GMT
- Title: MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
- Authors: Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang,
- Abstract summary: MergeMix is a training-time augmentation paradigm that bridges SFT and RL.<n>It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context.<n>It then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss.
- Score: 23.96717124380285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition [10.177623104133023]
We introduce Windowed SummaryMixing (WSM), which enhances SummaryMixing (SM)<n>WSM integrates local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies.<n>Our approach improves ASR performance while reducing peak VRAM usage by 40% in the SSL models.
arXiv Detail & Related papers (2026-02-04T06:01:30Z) - MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z) - Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models [0.0]
Multiscale Aggregated Hierarchical Attention (MAHA) is a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation.<n>MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators.<n> Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention.
arXiv Detail & Related papers (2025-12-16T21:27:21Z) - HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models [50.31704374968706]
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding.<n>They typically require extremely high computational resources for training to achieve cross-modal alignment at multi-granularity levels.<n>We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels.
arXiv Detail & Related papers (2025-10-23T08:16:44Z) - Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z) - Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking [18.604455802016233]
Expert Merging is a training-light method that learns a small set of layer-wise coefficients using unlabeled calibration data.<n>To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking.<n>Our method surpasses strong training-free and training-based merging baselines.
arXiv Detail & Related papers (2025-09-30T03:16:24Z) - UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z) - Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning [28.111812077758845]
Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks.
arXiv Detail & Related papers (2025-07-01T13:48:57Z) - Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need [83.10178754323955]
Few-shot learning enables models to generalize from only a few labeled examples.<n>We propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning.<n>Our method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in annotatedL.
arXiv Detail & Related papers (2025-03-28T07:23:07Z) - Over-the-Air Fair Federated Learning via Multi-Objective Optimization [52.295563400314094]
We propose an over-the-air fair federated learning algorithm (OTA-FFL) to train fair FL models.<n>Experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance.
arXiv Detail & Related papers (2025-01-06T21:16:51Z) - SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe [36.74756622715754]
Large language models (LLMs) undergo instruction tuning to acquire instruction-following capabilities.<n>Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning datasets.<n>We propose SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets.
arXiv Detail & Related papers (2024-10-07T17:52:21Z) - TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training [42.142924806184425]
Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss.
TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
arXiv Detail & Related papers (2023-12-14T12:02:24Z) - Pseudo-Bag Mixup Augmentation for Multiple Instance Learning-Based Whole
Slide Image Classification [18.679580844360615]
We propose a new Pseudo-bag Mixup (PseMix) data augmentation scheme to improve the training of MIL models.
Our scheme generalizes the Mixup strategy for general images to special WSIs via pseudo-bags.
It is designed as an efficient and decoupled method, neither involving time-consuming operations nor relying on MIL model predictions.
arXiv Detail & Related papers (2023-06-28T13:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.