Signals of Provenance: Practices & Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals
- URL: http://arxiv.org/abs/2505.16057v1
- Date: Wed, 21 May 2025 22:16:59 GMT
- Title: Signals of Provenance: Practices & Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals
- Authors: Ayae Ide, Tory Park, Jaron Mink, Tanusree Sharma,
- Abstract summary: We conducted interviews with sighted and BLV participants to examine their interaction with AIG content through self-disclosed indicators.<n>We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload.
- Score: 4.129013761788427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-Generated (AIG) content has become increasingly widespread by recent advances in generative models and the easy-to-use tools that have significantly lowered the technical barriers for producing highly realistic audio, images, and videos through simple natural language prompts. In response, platforms are adopting provable provenance with platforms recommending AIG to be self-disclosed and signaled to users. However, these indicators may be often missed, especially when they rely solely on visual cues and make them ineffective to users with different sensory abilities. To address the gap, we conducted semi-structured interviews (N=28) with 15 sighted and 13 BLV participants to examine their interaction with AIG content through self-disclosed AI indicators. Our findings reveal diverse mental models and practices, highlighting different strengths and weaknesses of content-based (e.g., title, description) and menu-aided (e.g., AI labels) indicators. While sighted participants leveraged visual and audio cues, BLV participants primarily relied on audio and existing assistive tools, limiting their ability to identify AIG. Across both groups, they frequently overlooked menu-aided indicators deployed by platforms and rather interacted with content-based indicators such as title and comments. We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload. These issues were especially critical for BLV individuals due to the insufficient accessibility of interface elements. We provide practical recommendations and design implications for future AIG indicators across several dimensions.
Related papers
- Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z) - Analyzing Character Representation in Media Content using Multimodal Foundation Model: Effectiveness and Trust [7.985473318714565]
We ask, even if character distribution along demographic dimensions are available, how useful are they to the general public?<n>Our work addresses these questions through a user study, while proposing a new AI-based character representation and visualization tool.<n>Our tool based on the Contrastive Language Image Pretraining (CLIP) foundation model to analyze visual screen data to quantify character representation across dimensions of age and gender.
arXiv Detail & Related papers (2025-06-02T13:46:28Z) - Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.28977802424541]
We introduce VCBENCH, a benchmark for multimodal mathematical reasoning with explicit visual dependencies.<n> VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning.<n>We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy.
arXiv Detail & Related papers (2025-04-24T06:16:38Z) - Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [90.65399476233495]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.<n>We propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach.
arXiv Detail & Related papers (2025-04-03T17:59:56Z) - Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models [93.46875303598577]
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals remains underexplored.<n>This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts.
arXiv Detail & Related papers (2025-04-02T10:47:07Z) - GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance [18.467461615621872]
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV)<n>We introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs.<n>We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T05:43:40Z) - VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues [34.95077625513563]
We introduce textbfVLM2-Bench, a benchmark designed to assess whether vision-language models can Visually Link Matching cues.<n> Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings.<n>We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap.
arXiv Detail & Related papers (2025-02-17T17:57:50Z) - Training Strategies for Isolated Sign Language Recognition [72.27323884094953]
This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition.<n>The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds.
arXiv Detail & Related papers (2024-12-16T08:37:58Z) - Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge [24.538839144639653]
Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components.
These models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM)
arXiv Detail & Related papers (2024-11-25T18:33:14Z) - UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.<n>It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.<n>UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z) - On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey [82.49623756124357]
Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen domains by learning generalized knowledge from limited data.<n>This paper thoroughly investigates recent advances in element-wise ZSIR and provides a basis for its future development.
arXiv Detail & Related papers (2024-08-09T05:49:21Z) - Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.<n> AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - A Contextualized Real-Time Multimodal Emotion Recognition for
Conversational Agents using Graph Convolutional Networks in Reinforcement
Learning [0.800062359410795]
We present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL)
Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information.
The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances.
arXiv Detail & Related papers (2023-10-24T14:31:17Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.