Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
- URL: http://arxiv.org/abs/2505.02872v1
- Date: Sun, 04 May 2025 13:23:48 GMT
- Title: Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
- Authors: Cfir Avraham Hadar, Omer Shubi, Yoav Meiri, Yevgeni Berzak,
- Abstract summary: We ask, for the first time, whether open-ended reading goals can be automatically decoded from eye movements in reading.<n>We develop and compare several discriminative and generative multimodal LLMs that combine eye movements and text for goal classification and goal reconstruction.<n>Our experiments show considerable success on both tasks, suggesting that LLMs can extract valuable information about the readers' text-specific goals from eye movements.
- Score: 1.2062053320259833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you only care about the question ``but does it work?''. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded from eye movements in reading. To address this question, we introduce goal classification and goal reconstruction tasks and evaluation frameworks, and use large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal LLMs that combine eye movements and text for goal classification and goal reconstruction. Our experiments show considerable success on both tasks, suggesting that LLMs can extract valuable information about the readers' text-specific goals from eye movements.
Related papers
- Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker [9.284765805642326]
Reading comprehension is a fundamental skill in human cognitive development.<n>There is a growing need to compare how humans and Large Language Models (LLMs) understand language across different contexts.
arXiv Detail & Related papers (2025-07-16T07:15:59Z) - Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z) - Reading Recognition in the Wild [20.787452286379292]
We introduce a new task of reading recognition to determine when the user is reading.<n>We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset.
arXiv Detail & Related papers (2025-05-30T17:46:13Z) - GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck.
Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts.
Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z) - Decoding Reading Goals from Eye Movements [1.3176926720381554]
We examine whether it is possible to distinguish between two types of common reading goals: information seeking and ordinary reading for comprehension.<n>Using large-scale eye tracking data, we address this task with a wide range of models that cover different architectural and data representation strategies.<n>We find that accurate predictions can be made in real time, long before the participant finished reading the text.
arXiv Detail & Related papers (2024-10-28T06:40:03Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Fine-Grained Prediction of Reading Comprehension from Eye Movements [1.2062053320259833]
We focus on a fine-grained task of predicting reading comprehension from eye movements at the level of a single question over a passage.
We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature.
The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension.
arXiv Detail & Related papers (2024-10-06T13:55:06Z) - CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks.
This raises the question: To what extent can LLMs learn orthographic information?
We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
Remote Sensing Image Generalization using CLIP [12.73827827842155]
We propose a novel image-conditioned prompt learning strategy called the Visual Attention conditioned Prompts Learning Network (APPLeNet)
APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks.
Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet.
arXiv Detail & Related papers (2023-04-12T17:20:37Z) - Multimedia Generative Script Learning for Task Planning [58.73725388387305]
We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities.
This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps.
Experiment results demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2022-08-25T19:04:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.