Related papers: AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code

AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code

URL: http://arxiv.org/abs/2507.19549v1
Date: Thu, 24 Jul 2025 17:59:30 GMT
Title: AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code
Authors: Nadeen Fathallah, Daniel Hernández, Steffen Staab,
Abstract summary: We introduce a novel taxonomy classifying Web accessibility violations into three key categories - Syntactic, Semantic, and Layout.<n>We propose a novel method, AccessGuru, which combines existing accessibility testing tools and Large Language Models (LLMs) to detect violations.<n>Our benchmark quantifies syntactic and layout compliance and judges semantic accuracy through comparative analysis with human expert corrections.
Score: 11.11923891120399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The vast majority of Web pages fail to comply with established Web accessibility guidelines, excluding a range of users with diverse abilities from interacting with their content. Making Web pages accessible to all users requires dedicated expertise and additional manual efforts from Web page providers. To lower their efforts and promote inclusiveness, we aim to automatically detect and correct Web accessibility violations in HTML code. While previous work has made progress in detecting certain types of accessibility violations, the problem of automatically detecting and correcting accessibility violations remains an open challenge that we address. We introduce a novel taxonomy classifying Web accessibility violations into three key categories - Syntactic, Semantic, and Layout. This taxonomy provides a structured foundation for developing our detection and correction method and redefining evaluation metrics. We propose a novel method, AccessGuru, which combines existing accessibility testing tools and Large Language Models (LLMs) to detect violations and applies taxonomy-driven prompting strategies to correct all three categories. To evaluate these capabilities, we develop a benchmark of real-world Web accessibility violations. Our benchmark quantifies syntactic and layout compliance and judges semantic accuracy through comparative analysis with human expert corrections. Evaluation against our benchmark shows that AccessGuru achieves up to 84% average violation score decrease, significantly outperforming prior methods that achieve at most 50%.

Related papers

Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z)
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z)
From Code to Compliance: Assessing ChatGPT's Utility in Designing an Accessible Webpage -- A Case Study [1.999925939110439]
This study evaluates ChatGPT's ability to generate and improve web pages in line with Web Content Accessibility Guidelines (WCAG)<n>While ChatGPT can effectively address accessibility issues when prompted, its default code often lacks compliance.<n>We found that effective prompt engineering, such as providing concise, structured feedback and incorporating visual aids, significantly enhances ChatGPT's performance.
arXiv Detail & Related papers (2025-01-07T06:51:46Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract) [73.57710917145212]
Learning to rank is widely employed in web searches to prioritize pertinent webpages based on input queries. We propose a emphulineGenerative ulineSemi-ulineSupervised ulinePre-trained (GS2P) model to address these challenges. We conduct extensive offline experiments on both a publicly available dataset and a real-world dataset collected from a large-scale search engine.
arXiv Detail & Related papers (2024-09-25T03:39:14Z)
ACCESS: Prompt Engineering for Automated Web Accessibility Violation Corrections [0.0]
This paper presents a novel approach to correcting accessibility violations on the web by modifying the document object model (DOM) in real time with foundation models. We achieved greater than a 51% reduction in accessibility violation errors after corrections on our novel benchmark: ACCESS.
arXiv Detail & Related papers (2024-01-28T22:49:33Z)
Integrity and Junkiness Failure Handling for Embedding-based Retrieval: A Case Study in Social Network Search [26.705196461992845]
Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc. In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine. We define two main categories of failures introduced by it, integrity and junkiness.
arXiv Detail & Related papers (2023-04-18T20:53:47Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Weakly Supervised Anomaly Detection: A Survey [75.26180038443462]
Anomaly detection (AD) is a crucial task in machine learning with various applications. We present the first comprehensive survey of weakly supervised anomaly detection (WSAD) methods. For each setting, we provide formal definitions, key algorithms, and potential future directions.
arXiv Detail & Related papers (2023-02-09T10:27:21Z)
Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems. We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z)
Towards Ethics by Design in Online Abusive Content Detection [7.163723138100273]
The research effort has spread out across several closely related sub-areas, such as detection of hate speech, toxicity, cyberbullying, etc. We bring ethical issues to forefront and propose a unified framework as a two-step process. The novel framework is guided by the Ethics by Design principle and is a step towards building more accurate and trusted models.
arXiv Detail & Related papers (2020-10-28T13:10:24Z)
Mining Implicit Relevance Feedback from User Behavior for Web Question Answering [92.45607094299181]
We make the first study to explore the correlation between user behavior and passage relevance. Our approach significantly improves the accuracy of passage ranking without extra human labeled data. In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine.
arXiv Detail & Related papers (2020-06-13T07:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.