Related papers: Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

URL: http://arxiv.org/abs/2505.21733v1
Date: Tue, 27 May 2025 20:22:45 GMT
Title: Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
Authors: Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Emily Wenger,
Abstract summary: We conduct the first large-scale study of web scraper compliance with robots.txt directives using anonymized web logs from our institution.<n>We find that bots are less likely to comply with stricter robots.txt directives, and that certain categories of bots, including AI search crawlers, rarely check robots.txt at all.<n>These findings suggest that relying on robots.txt to prevent unwanted scraping is risky and highlight the need for alternative approaches.
Score: 4.68008217188575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robots.txt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance with robots.txt directives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlled robots.txt experiments. We find that bots are less likely to comply with stricter robots.txt directives, and that certain categories of bots, including AI search crawlers, rarely check robots.txt at all. These findings suggest that relying on robots.txt files to prevent unwanted scraping is risky and highlight the need for alternative approaches.

Related papers

On the efficacy of old features for the detection of new bots [0.4506099292980221]
We compare the performances of four state-of-art feature sets in detecting novel bots using Twitter as a benchmark.<n>The results hint at the possible use of general-purpose classifiers and cheap-to-compute account features for the detection of evolved bots.
arXiv Detail & Related papers (2025-06-24T13:56:09Z)
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics [55.05920313034645]
We introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.<n>Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions.<n>Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
arXiv Detail & Related papers (2025-05-29T16:41:12Z)
The Liabilities of Robots.txt [19.970962071144722]
The robots.txt file, introduced as part of the Robots Exclusion Protocol in 1994, provides webmasters with a mechanism to communicate access permissions to automated bots.<n>While broadly adopted as a community standard, the legal liabilities associated with violating robots.txt remain ambiguous.<n>This paper clarifies the liabilities associated with robots.txt within the contexts of contract, copyright, and tort law.
arXiv Detail & Related papers (2025-03-08T03:16:17Z)
What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics [5.494111035517598]
Bots tend to use linguistic cues that can be easily automated while humans use cues that require dialogue understanding.<n>These conclusions are based on a large-scale analysis of social media tweets across 200mil users across 7 events.
arXiv Detail & Related papers (2025-01-01T14:45:43Z)
FP-Inconsistent: Detecting Evasive Bots using Browser Fingerprint Inconsistencies [13.105329613926623]
We conduct the first large-scale evaluation of evasive bots to investigate whether and how altering fingerprints helps bots evade detection.<n>We find an average evasion rate of 52.93% against DataDome and 44.56% evasion rate against BotD.<n>Given evasive bots seem to have difficulty in ensuring consistency in their fingerprint attributes, we propose a data-driven approach to discover rules to detect such inconsistencies.
arXiv Detail & Related papers (2024-06-11T18:26:17Z)
My Brother Helps Me: Node Injection Based Adversarial Attack on Social Bot Detection [69.99192868521564]
Social platforms such as Twitter are under siege from a multitude of fraudulent users. Due to the structure of social networks, the majority of methods are based on the graph neural network(GNN), which is susceptible to attacks. We propose a node injection-based adversarial attack method designed to deceive bot detection models.
arXiv Detail & Related papers (2023-10-11T03:09:48Z)
Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement Learning [54.636562516974884]
In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on. In this work, we propose MEDAL++, a novel design for self-improving robotic systems. The robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations.
arXiv Detail & Related papers (2023-03-02T18:51:38Z)
Fleet-DAgger: Interactive Robot Fleet Learning with Scalable Human Supervision [72.4735163268491]
Commercial and industrial deployments of robot fleets often fall back on remote human teleoperators during execution. We formalize the Interactive Fleet Learning (IFL) setting, in which multiple robots interactively query and learn from multiple human supervisors. We propose Fleet-DAgger, a family of IFL algorithms, and compare a novel Fleet-DAgger algorithm to 4 baselines in simulation.
arXiv Detail & Related papers (2022-06-29T01:23:57Z)
REvolveR: Continuous Evolutionary Models for Robot-to-robot Policy Transfer [57.045140028275036]
We consider the problem of transferring a policy across two different robots with significantly different parameters such as kinematics and morphology. Existing approaches that train a new policy by matching the action or state transition distribution, including imitation learning methods, fail due to optimal action and/or state distribution being mismatched in different robots. We propose a novel method named $REvolveR$ of using continuous evolutionary models for robotic policy transfer implemented in a physics simulator.
arXiv Detail & Related papers (2022-02-10T18:50:25Z)
CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement Learning [60.348822346249854]
This study presents a framework whereby several empathetic chatbots are based on understanding users' implied feelings and replying empathetically for multiple dialogue turns. We call these chatbots CheerBots. CheerBots can be retrieval-based or generative-based and were finetuned by deep reinforcement learning. To respond in an empathetic way, we develop a simulating agent, a Conceptual Human Model, as aids for CheerBots in training with considerations on changes in user's emotional states in the future to arouse sympathy.
arXiv Detail & Related papers (2021-10-08T07:44:47Z)
Detection of Novel Social Bots by Ensembles of Specialized Classifiers [60.63582690037839]
Malicious actors create inauthentic social media accounts controlled in part by algorithms, known as social bots, to disseminate misinformation and agitate online discussion. We show that different types of bots are characterized by different behavioral features. We propose a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule.
arXiv Detail & Related papers (2020-06-11T22:59:59Z)
Detecting and Characterizing Bots that Commit Code [16.10540443996897]
We propose a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created.
arXiv Detail & Related papers (2020-03-02T21:54:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.