Reading Recognition in the Wild
- URL: http://arxiv.org/abs/2505.24848v2
- Date: Thu, 05 Jun 2025 09:53:39 GMT
- Title: Reading Recognition in the Wild
- Authors: Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Carl Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim,
- Abstract summary: We introduce a new task of reading recognition to determine when the user is reading.<n>We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset.
- Score: 20.787452286379292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.
Related papers
- Decoding Reading Goals from Eye Movements [1.3176926720381554]
We examine whether it is possible to distinguish between two types of common reading goals: information seeking and ordinary reading for comprehension.<n>Using large-scale eye tracking data, we address this task with a wide range of models that cover different architectural and data representation strategies.<n>We find that accurate predictions can be made in real time, long before the participant finished reading the text.
arXiv Detail & Related papers (2024-10-28T06:40:03Z) - Fine-Grained Prediction of Reading Comprehension from Eye Movements [1.2062053320259833]
We focus on a fine-grained task of predicting reading comprehension from eye movements at the level of a single question over a passage.
We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature.
The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension.
arXiv Detail & Related papers (2024-10-06T13:55:06Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.<n>To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.<n>Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model [2.2469442203227863]
We propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM)
The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods.
The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses.
arXiv Detail & Related papers (2024-04-14T13:39:02Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting.
M3L learns a policy and visual-tactile representations based on masked autoencoding.
We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z) - Speech representation learning: Learning bidirectional encoders with
single-view, multi-view, and multi-task methods [7.1345443932276424]
This thesis focuses on representation learning for sequence data over time or space.
It aims to improve downstream sequence prediction tasks by using the learned representations.
arXiv Detail & Related papers (2023-07-25T20:38:55Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - Hone as You Read: A Practical Type of Interactive Summarization [6.662800021628275]
We present HARE, a new task where reader feedback is used to optimize document summaries for personal interest.
This task is related to interactive summarization, where personalized summaries are produced following a long feedback stage.
We propose to gather minimally-invasive feedback during the reading process to adapt to user interests and augment the document in real-time.
arXiv Detail & Related papers (2021-05-06T19:36:40Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.