LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction
- URL: http://arxiv.org/abs/2508.11153v1
- Date: Fri, 15 Aug 2025 01:49:58 GMT
- Title: LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction
- Authors: Maoquan Zhang, Bisser Raytchev, Xiujuan Sun,
- Abstract summary: LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education.<n>It is the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding.<n>The code and dataset will be released to facilitate future research and practical deployment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education. It leverages a curated BookCover dataset that provides narrative layouts and structured visual cues, enabling the model to depict abstract and sequential scientific concepts with strong semantic alignment. Through layout-conditioned generation, contrastive visual-semantic training, and prompt modulation, LEARN produces coherent visual sequences that support mid-to-high-level reasoning in line with Bloom's taxonomy while reducing extraneous cognitive load as emphasized by Cognitive Load Theory. By fostering spatially organized and story-driven narratives, the framework counters fragmented attention often induced by short-form media and promotes sustained conceptual focus. Beyond static diagrams, LEARN demonstrates potential for integration with multimodal systems and curriculum-linked knowledge graphs to create adaptive, exploratory educational content. As the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding, LEARN represents a novel direction for generative AI in education. The code and dataset will be released to facilitate future research and practical deployment.
Related papers
- CLLMRec: LLM-powered Cognitive-Aware Concept Recommendation via Semantic Alignment and Prerequisite Knowledge Distillation [3.200298153814017]
The growth of Massive Open Online Courses (MOOCs) presents significant challenges for personalized learning, where concept is crucial.<n>Existing approaches typically rely on heterogeneous information networks or knowledge graphs to capture conceptual relationships, combined with knowledge tracing models to assess learners' cognitive states.<n>This paper proposes CLLMRec, a novel framework that leverages Large Language Models to generate personalized concept recommendations.
arXiv Detail & Related papers (2025-11-21T08:37:39Z) - Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models [0.5352699766206808]
Diagrams are crucial yet underexplored tools in many disciplines.<n>Their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital captions.<n>Visual Language Models (VLMs) can help us identify and interpret such hybrid pages in context.
arXiv Detail & Related papers (2025-11-17T13:52:23Z) - Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts [1.1883838320818292]
We propose a novel framework that harnesses visual concepts generated by large language models (LLMs) as discriminative semantic guidance.<n>Our method dynamically constructs a visual concept pool with a similarity-based filtering mechanism to prevent redundancy.<n>Through attention, the module can leverage the semantic knowledge from relevant visual concepts and produce class-representative fused features for classification.
arXiv Detail & Related papers (2025-08-05T05:15:54Z) - Embryology of a Language Model [1.1874560263468232]
In this work, we introduce an embryological approach, applying UMAP to the susceptibility matrix to visualize the model's structural development over training.<n>Our visualizations reveal the emergence of a clear body plan'' charting the formation of known features like the induction circuit and discovering previously unknown structures.
arXiv Detail & Related papers (2025-08-01T05:39:41Z) - SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z) - Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes.<n>Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features.<n>This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z) - Emergent Visual-Semantic Hierarchies in Image-Text Representations [13.300199242824934]
We study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies.
We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding.
arXiv Detail & Related papers (2024-07-11T14:09:42Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - A Message Passing Perspective on Learning Dynamics of Contrastive
Learning [60.217972614379065]
We show that if we cast a contrastive objective equivalently into the feature space, then its learning dynamics admits an interpretable form.
This perspective also establishes an intriguing connection between contrastive learning and Message Passing Graph Neural Networks (MP-GNNs)
arXiv Detail & Related papers (2023-03-08T08:27:31Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Knowledge-enriched Attention Network with Group-wise Semantic for Visual
Storytelling [39.59158974352266]
Visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images.
Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images.
To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed.
arXiv Detail & Related papers (2022-03-10T12:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.