Related papers: MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

URL: http://arxiv.org/abs/2510.23479v1
Date: Mon, 27 Oct 2025 16:12:40 GMT
Title: MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
Authors: Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang,
Abstract summary: MergeMix is a training-time augmentation paradigm that bridges SFT and RL.<n>It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context.<n>It then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss.
Score: 23.96717124380285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition [10.177623104133023]
We introduce Windowed SummaryMixing (WSM), which enhances SummaryMixing (SM)<n>WSM integrates local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies.<n>Our approach improves ASR performance while reducing peak VRAM usage by 40% in the SSL models.
arXiv Detail & Related papers (2026-02-04T06:01:30Z)
MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z)
Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models [0.0]
Multiscale Aggregated Hierarchical Attention (MAHA) is a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation.<n>MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators.<n> Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention.
arXiv Detail & Related papers (2025-12-16T21:27:21Z)
HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models [50.31704374968706]
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding.<n>They typically require extremely high computational resources for training to achieve cross-modal alignment at multi-granularity levels.<n>We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels.
arXiv Detail & Related papers (2025-10-23T08:16:44Z)
Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z)
Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking [18.604455802016233]
Expert Merging is a training-light method that learns a small set of layer-wise coefficients using unlabeled calibration data.<n>To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking.<n>Our method surpasses strong training-free and training-based merging baselines.
arXiv Detail & Related papers (2025-09-30T03:16:24Z)
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z)
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning [28.111812077758845]
Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks.
arXiv Detail & Related papers (2025-07-01T13:48:57Z)
Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need [83.10178754323955]
Few-shot learning enables models to generalize from only a few labeled examples.<n>We propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning.<n>Our method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in annotatedL.
arXiv Detail & Related papers (2025-03-28T07:23:07Z)
Over-the-Air Fair Federated Learning via Multi-Objective Optimization [52.295563400314094]
We propose an over-the-air fair federated learning algorithm (OTA-FFL) to train fair FL models.<n>Experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance.
arXiv Detail & Related papers (2025-01-06T21:16:51Z)
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe [36.74756622715754]
Large language models (LLMs) undergo instruction tuning to acquire instruction-following capabilities.<n>Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning datasets.<n>We propose SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets.
arXiv Detail & Related papers (2024-10-07T17:52:21Z)
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training [42.142924806184425]
Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
arXiv Detail & Related papers (2023-12-14T12:02:24Z)
Pseudo-Bag Mixup Augmentation for Multiple Instance Learning-Based Whole Slide Image Classification [18.679580844360615]
We propose a new Pseudo-bag Mixup (PseMix) data augmentation scheme to improve the training of MIL models. Our scheme generalizes the Mixup strategy for general images to special WSIs via pseudo-bags. It is designed as an efficient and decoupled method, neither involving time-consuming operations nor relying on MIL model predictions.
arXiv Detail & Related papers (2023-06-28T13:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.