UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation
- URL: http://arxiv.org/abs/2508.16239v1
- Date: Fri, 22 Aug 2025 09:20:00 GMT
- Title: UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation
- Authors: Nan wang, Zhiyi Xia, Yiming Li, Shi Tang, Zuxin Fan, Xi Fang, Haoyi Tao, Xiaochen Cai, Guolin Ke, Linfeng Zhang, Yanhui Hong,
- Abstract summary: We introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding.<n>It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions.<n>A text-to-image diffusion model trained on the entire collection serves as both a powerful data augmentation tool and a proxy for the complete data distribution.
- Score: 19.67541048907923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark -- available at huggingface -- will significantly accelerate progress in automated materials analysis.
Related papers
- SAM 3D Body: Robust Full-Body Human Mesh Recovery [65.0108906331903]
We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR)<n>3DB estimates the human pose of the body, feet, and hands.<n>It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape.
arXiv Detail & Related papers (2026-02-17T20:26:37Z) - Large-scale EM Benchmark for Multi-Organelle Instance Segmentation in the Wild [8.670858548670742]
We develop a benchmark for multi-organelle instance segmentation, comprising over 100,000 2D EM images across variety cell types and five organelle classes that capture real-world variability.<n>Our results show several limitations: current models struggle to generalize across heterogeneous EM data and perform poorly on organelles with global, distributed morphologies.<n>These findings underscore the fundamental mismatch between local-context models and the challenge of modeling long-range structural continuity in the presence of real-world variability.
arXiv Detail & Related papers (2026-01-18T16:09:27Z) - Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding [0.1411701037241356]
We propose a Multimodal Autoencoder that learns unified representations across text, audio, and visual data.<n>We demonstrate significant improvements in clustering and alignment metrics compared to linear baselines.<n>Results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast.
arXiv Detail & Related papers (2025-11-17T19:13:51Z) - Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework [2.172419551358714]
MultiCrystalSpectrumSet (MCS-Set) is a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations.<n>MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision.
arXiv Detail & Related papers (2025-05-30T23:18:42Z) - M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z) - MRGen: Segmentation Data Engine for Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities.<n>We present MRGen, a data engine for controllable medical image synthesis conditioned on text prompts and segmentation masks.
arXiv Detail & Related papers (2024-12-04T16:34:22Z) - Revealing the Evolution of Order in Materials Microstructures Using Multi-Modal Computer Vision [4.6481041987538365]
Development of high-performance materials for microelectronics depends on our ability to describe and direct property-defining microstructural order.
Here, we demonstrate a multi-modal machine learning (ML) approach to describe order from electron microscopy analysis of the complex oxide La$_1-x$Sr$_x$FeO$_3$.
We observe distinct differences in the performance of uni- and multi-modal models, from which we draw general lessons in describing crystal order using computer vision.
arXiv Detail & Related papers (2024-11-15T02:44:32Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - MatSAM: Efficient Extraction of Microstructures of Materials via Visual
Large Model [11.130574172301365]
Segment Anything Model (SAM) is a large visual model with powerful deep feature representation and zero-shot generalization capabilities.
In this paper, we propose MatSAM, a general and efficient microstructure extraction solution based on SAM.
A simple yet effective point-based prompt generation strategy is designed, grounded on the distribution and shape of microstructures.
arXiv Detail & Related papers (2024-01-11T03:18:18Z) - Learning Multiscale Consistency for Self-supervised Electron Microscopy
Instance Segmentation [48.267001230607306]
We propose a pretraining framework that enhances multiscale consistency in EM volumes.
Our approach leverages a Siamese network architecture, integrating strong and weak data augmentations.
It effectively captures voxel and feature consistency, showing promise for learning transferable representations for EM analysis.
arXiv Detail & Related papers (2023-08-19T05:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.