Related papers: Enhancing Textbooks with Visuals from the Web for Improved Learning

Enhancing Textbooks with Visuals from the Web for Improved Learning

URL: http://arxiv.org/abs/2304.08931v2
Date: Fri, 20 Oct 2023 11:06:14 GMT
Title: Enhancing Textbooks with Visuals from the Web for Improved Learning
Authors: Janvijay Singh, Vil\'em Zouhar, Mrinmaya Sachan
Abstract summary: In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web. We collect a dataset of e-textbooks in the math, science, social science and business domains. We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks.
Score: 50.01434477801967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Textbooks are one of the main mediums for delivering high-quality education to students. In particular, explanatory and illustrative visuals play a key role in retention, comprehension and general transfer of knowledge. However, many textbooks lack these interesting visuals to support student learning. In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web. We collect a dataset of e-textbooks in the math, science, social science and business domains. We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks, which we frame as a matching optimization problem. Through a crowd-sourced evaluation, we verify that (1) while the original textbook images are rated higher, automatically assigned ones are not far behind, and (2) the precise formulation of the optimization problem matters. We release the dataset of textbooks with an associated image bank to inspire further research in this intersectional area of computer vision and NLP for education.

Related papers

Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z)
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining [86.76706820098867]
We introduce a high-quality textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment.
arXiv Detail & Related papers (2025-01-01T21:29:37Z)
Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark. The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces. In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z)
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation [5.064384692591668]
This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment.
arXiv Detail & Related papers (2021-09-04T22:48:46Z)
Enhancing Social Relation Inference with Concise Interaction Graph and Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE) It concisely learns interactive features of persons and discriminative features of holistic scenes. PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z)
Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)
Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.