Enhancing Textbooks with Visuals from the Web for Improved Learning
- URL: http://arxiv.org/abs/2304.08931v2
- Date: Fri, 20 Oct 2023 11:06:14 GMT
- Title: Enhancing Textbooks with Visuals from the Web for Improved Learning
- Authors: Janvijay Singh, Vil\'em Zouhar, Mrinmaya Sachan
- Abstract summary: In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web.
We collect a dataset of e-textbooks in the math, science, social science and business domains.
We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks.
- Score: 50.01434477801967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Textbooks are one of the main mediums for delivering high-quality education
to students. In particular, explanatory and illustrative visuals play a key
role in retention, comprehension and general transfer of knowledge. However,
many textbooks lack these interesting visuals to support student learning. In
this paper, we investigate the effectiveness of vision-language models to
automatically enhance textbooks with images from the web. We collect a dataset
of e-textbooks in the math, science, social science and business domains. We
then set up a text-image matching task that involves retrieving and
appropriately assigning web images to textbooks, which we frame as a matching
optimization problem. Through a crowd-sourced evaluation, we verify that (1)
while the original textbook images are rated higher, automatically assigned
ones are not far behind, and (2) the precise formulation of the optimization
problem matters. We release the dataset of textbooks with an associated image
bank to inspire further research in this intersectional area of computer vision
and NLP for education.
Related papers
- 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining [86.76706820098867]
We introduce a high-quality textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining.
It collects over 2.5 years of instructional videos, totaling 22,000 class hours.
Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment.
arXiv Detail & Related papers (2025-01-01T21:29:37Z) - Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark.
The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation [5.064384692591668]
This paper proposes LAViTeR, a novel architecture for visual and textual representation learning.
The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning.
The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment.
arXiv Detail & Related papers (2021-09-04T22:48:46Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.