Enhancing Textbooks with Visuals from the Web for Improved Learning
- URL: http://arxiv.org/abs/2304.08931v2
- Date: Fri, 20 Oct 2023 11:06:14 GMT
- Title: Enhancing Textbooks with Visuals from the Web for Improved Learning
- Authors: Janvijay Singh, Vil\'em Zouhar, Mrinmaya Sachan
- Abstract summary: In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web.
We collect a dataset of e-textbooks in the math, science, social science and business domains.
We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks.
- Score: 50.01434477801967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Textbooks are one of the main mediums for delivering high-quality education
to students. In particular, explanatory and illustrative visuals play a key
role in retention, comprehension and general transfer of knowledge. However,
many textbooks lack these interesting visuals to support student learning. In
this paper, we investigate the effectiveness of vision-language models to
automatically enhance textbooks with images from the web. We collect a dataset
of e-textbooks in the math, science, social science and business domains. We
then set up a text-image matching task that involves retrieving and
appropriately assigning web images to textbooks, which we frame as a matching
optimization problem. Through a crowd-sourced evaluation, we verify that (1)
while the original textbook images are rated higher, automatically assigned
ones are not far behind, and (2) the precise formulation of the optimization
problem matters. We release the dataset of textbooks with an associated image
bank to inspire further research in this intersectional area of computer vision
and NLP for education.
Related papers
- Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark.
The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation [5.064384692591668]
This paper proposes LAViTeR, a novel architecture for visual and textual representation learning.
The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning.
The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment.
arXiv Detail & Related papers (2021-09-04T22:48:46Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.