Related papers: Decoding Reading Goals from Eye Movements

Related papers

Déjà Vu? Decoding Repeated Reading from Eye Movements [1.1652979442763178]
We ask whether it is possible to automatically determine whether the reader has previously encountered a text based on their eye movement patterns. We introduce two variants of this task and address them with considerable success using both feature-based and neural models. We present an analysis of model performance which on the one hand yields insights on the information used by the models, and on the other hand leverages predictive modeling as an analytic tool for better characterization of the role of memory in repeated reading.
arXiv Detail & Related papers (2025-02-16T09:59:29Z)
Fine-Grained Prediction of Reading Comprehension from Eye Movements [1.2062053320259833]
We focus on a fine-grained task of predicting reading comprehension from eye movements at the level of a single question over a passage. We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature. The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension.
arXiv Detail & Related papers (2024-10-06T13:55:06Z)
Autoregressive Pre-Training on Pixels and Texts [35.82610192457444]
We explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks.
arXiv Detail & Related papers (2024-04-16T16:36:50Z)
Visually Grounded Language Learning: a review of language games, datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts [0.5520145204626482]
Eye movements in reading play a crucial role in psycholinguistic research. The scarcity of eye movement data and its unavailability at application time poses a major challenge for this line of research. We propose ScanDL, a novel discrete sequence-to-sequence diffusion model that generates synthetic scanpaths on texts.
arXiv Detail & Related papers (2023-10-24T07:52:19Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
An Analysis of Reader Engagement in Literary Fiction through Eye Tracking and Linguistic Features [11.805980147608178]
We analyzed the significance of various qualities of the text in predicting how engaging a reader is likely to find it. Furthering our understanding of what captivates readers in fiction will help better inform models used in creative narrative generation.
arXiv Detail & Related papers (2023-06-06T22:14:59Z)
CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z)
Exploring Effective Factors for Improving Visual In-Context Learning [56.14208975380607]
In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. This paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. We propose a simple framework prompt-SelF for visual in-context learning.
arXiv Detail & Related papers (2023-04-10T17:59:04Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
Decoding Attention from Gaze: A Benchmark Dataset and End-to-End Models [6.642042615005632]
Eye-tracking has potential to provide rich behavioral data about human cognition in ecologically valid environments. This paper studies using computer vision tools for "attention decoding", the task of assessing the locus of a participant's overt visual attention over time.
arXiv Detail & Related papers (2022-11-20T12:24:57Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap. Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
CLRGaze: Contrastive Learning of Representations for Eye Movement Signals [0.0]
We learn feature vectors of eye movements in a self-supervised manner. We adopt a contrastive learning approach and propose a set of data transformations that encourage a deep neural network to discern salient and granular gaze patterns.
arXiv Detail & Related papers (2020-10-25T06:12:06Z)
COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
A Novel Attention-based Aggregation Function to Combine Vision and Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.