Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
- URL: http://arxiv.org/abs/2412.08737v1
- Date: Wed, 11 Dec 2024 19:12:13 GMT
- Title: Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
- Authors: Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger,
- Abstract summary: This paper introduces Geoperception, a benchmark to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image.<n>We then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks.<n>We develop Euclid, a family of models specifically optimized for strong low-level geometric perception.
- Score: 23.294711275107606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.
Related papers
- Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration [57.95306827012784]
We propose GeoGen, a pipeline that can automatically generate step-wise reasoning paths for geometry diagrams.
By leveraging the precise symbolic reasoning, textbfGeoGen produces large-scale, high-quality question-answer pairs.
We train textbfGeoLogic, a Large Language Model (LLM), using synthetic data generated by GeoGen.
arXiv Detail & Related papers (2025-04-17T09:13:46Z) - GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning [20.399408869403437]
Geometry problem-solving (GPS) is a challenging task requiring both visual comprehension and symbolic reasoning.
Existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in large language models.
We introduce GeoSense, the first comprehensive bilingual benchmark designed to evaluate the geometric reasoning abilities of MLLMs.
arXiv Detail & Related papers (2025-04-17T02:46:27Z) - MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams [65.02628814094639]
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.
Current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether Multimodal Large Language Models genuinely understand mathematical diagrams beyond superficial pattern recognition.
We introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.
We construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text annotated with geometric primitives and precise spatial relationships.
arXiv Detail & Related papers (2025-03-26T17:30:41Z) - OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.
We propose a MLLM (OmniGeo) tailored to geospatial applications.
By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z) - Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.
The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.
We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z) - GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models [34.647839550142834]
We introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs.
Our evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks.
We show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks.
arXiv Detail & Related papers (2024-12-30T16:01:43Z) - Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning [4.4615747404424395]
Geometry mathematics problems pose significant challenges for large language models (LLMs)
We collect a geometry question-answer dataset by sourcing geometric data from Chinese high school education websites, referred to as GeoMath.
We propose a Large Multi-modal Model (LMM) framework named Geo-LLaVA, which incorporates retrieval augmentation with supervised fine-tuning (SFT) in the training stage, called meta-training, and employs in-context learning (ICL) during inference to improve performance.
arXiv Detail & Related papers (2024-12-12T07:34:09Z) - Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities.<n>This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z) - Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring [34.37450586634531]
This paper presents GPSM4K, a comprehensive geometry multimodal dataset tailored to augment the problem-solving capabilities of Large Vision Language Models (LVLMs)<n>GPSM4K encompasses 2157 multimodal question-answer pairs manually extracted from mathematics textbooks spanning grades 7-12.<n>This dataset serves as an excellent benchmark for assessing the geometric reasoning capabilities of LVLMs.
arXiv Detail & Related papers (2024-12-01T15:19:23Z) - Geometry Distributions [51.4061133324376]
We propose a novel geometric data representation that models geometry as distributions.
Our approach uses diffusion models with a novel network architecture to learn surface point distributions.
We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity.
arXiv Detail & Related papers (2024-11-25T04:06:48Z) - Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver [11.69164802295844]
We introduce a new framework that integrates visual features, geometric formal language, and natural language representations.
We propose a novel synthetic data approach and create a large-scale geometric dataset, SynthGeo228K, annotated with both formal and natural language captions.
Our framework improves MLLMs' ability to process geometric diagrams and extends their application to open-ended tasks on the formalgeo7k dataset.
arXiv Detail & Related papers (2024-09-06T12:11:06Z) - GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models [21.209275651704758]
We focus on the geometric comprehension of Vision Language Models (VLMs)
We benchmark 17 state-of-the-art VLMs using datasets encompassing Synthetic 2D, Synthetic 3D, and Real-World scenarios.
Our key insights include detailed analyses of the shortcomings in depth and height reasoning capabilities of VLMs and the inherent bias present in these models.
arXiv Detail & Related papers (2024-08-21T16:16:18Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [124.68242155098189]
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities.
G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
arXiv Detail & Related papers (2023-12-18T17:36:20Z) - Hyperbolic Graph Learning: A Comprehensive Review [56.53820115624101]
This survey paper provides a comprehensive review of the rapidly evolving field of Hyperbolic Graph Learning (HGL)<n>We systematically categorize and analyze existing methods dividing them into (1) hyperbolic graph embedding-based techniques, (2) graph neural network-based hyperbolic models, and (3) emerging paradigms.<n>We extensively discuss diverse applications of HGL across multiple domains, including recommender systems, knowledge graphs, bioinformatics, and other relevant scenarios.
arXiv Detail & Related papers (2022-02-28T15:08:48Z) - DONet: Learning Category-Level 6D Object Pose and Size Estimation from
Depth Observation [53.55300278592281]
We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image.
Our framework makes inferences based on the rich geometric information of the object in the depth channel alone.
Our framework competes with state-of-the-art approaches that require labeled real-world images.
arXiv Detail & Related papers (2021-06-27T10:41:50Z) - Graph Signal Processing for Geometric Data and Beyond: Theory and
Applications [55.81966207837108]
Graph Signal Processing (GSP) enables processing signals that reside on irregular domains.
GSP methodologies for geometric data in a unified manner by bridging the connections between geometric data and graphs.
Recently developed Graph Neural Networks (GNNs) interpret the operation of these networks from the perspective of GSP.
arXiv Detail & Related papers (2020-08-05T03:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.