Related papers: MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

URL: http://arxiv.org/abs/2505.20152v3
Date: Wed, 01 Oct 2025 03:17:28 GMT
Title: MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Authors: Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li,
Abstract summary: Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences.<n>We propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives.<n>We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving.
Score: 60.20220180316705
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing the training pipeline of vision encoder for fine-grained geometric reasoning tasks. https://github.com/THU-KEG/MMGeoLM.

Related papers

SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM [10.006619357851843]
SupScene is a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for Structure-from-Motion (SfM)<n>Our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters.
arXiv Detail & Related papers (2026-01-17T06:28:47Z)
More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying [7.9925771591348065]
Core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs.<n>In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples.<n>We propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings.
arXiv Detail & Related papers (2025-05-28T11:18:19Z)
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models [9.682523487279976]
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks.<n>Existing methods primarily fine-tune the model by generating text-based hard negative samples.<n>AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model.
arXiv Detail & Related papers (2025-05-21T14:28:43Z)
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z)
GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models [10.443672399225983]
Vision-parametric models (VLMs) have made significant progress in various multimodal tasks. They still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training. We present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library.
arXiv Detail & Related papers (2024-10-17T12:56:52Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [124.68242155098189]
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities. G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
arXiv Detail & Related papers (2023-12-18T17:36:20Z)
Learning Structure-from-Motion with Graph Attention Networks [23.87562683118926]
We tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. In this work we learn a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences.
arXiv Detail & Related papers (2023-08-30T12:13:13Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs) We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z)
Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching [26.994954303270575]
Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations.
arXiv Detail & Related papers (2023-06-13T17:55:39Z)
Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks. This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.