SciFi-Benchmark: How Would AI-Powered Robots Behave in Science Fiction Literature?
- URL: http://arxiv.org/abs/2503.10706v1
- Date: Wed, 12 Mar 2025 16:35:51 GMT
- Title: SciFi-Benchmark: How Would AI-Powered Robots Behave in Science Fiction Literature?
- Authors: Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani,
- Abstract summary: We generate a benchmark spanning the key moments in 824 pieces of science fiction literature.<n>We use a LLM's recollection of each key moment to generate questions in similar situations.<n>We then measure an approximation of how well models align with human values on a set of human-voted answers.
- Score: 20.51881907653089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a LLM's recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in SciFi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers, in addition to a smaller human-labeled evaluation set. Data is available at https://scifi-benchmark.github.io
Related papers
- Development of a PPO-Reinforcement Learned Walking Tripedal Soft-Legged Robot using SOFA [0.0]
This paper presents a ready-to-deploy walking, tripedal, soft-legged robot based on PPO-RL.
An 82% success rate in reaching a single goal is a groundbreaking output.
While trailing the platform steps, outperforming discovery has been observed with an accumulative squared error deviation of 19 mm.
arXiv Detail & Related papers (2025-04-12T14:46:51Z) - Generating Robot Constitutions & Benchmarks for Semantic Safety [22.889717765617394]
We release the ASIMOV Benchmark for evaluating semantic safety of robot brains.<n>We develop a framework to automatically generate robot constitutions from real-world data.<n>We propose a novel auto-amending process that is able to introduce nuances in written rules of behavior.
arXiv Detail & Related papers (2025-03-11T17:50:47Z) - The One RING: a Robotic Indoor Navigation Generalist [58.431772508378344]
RING (Robotic Indoor Navigation Generalist) is an embodiment-agnostic policy.<n>It is trained solely in simulation with diverse randomly embodiments at scale.<n>It achieves an average of 72.1% and 78.9% success rate across 5 embodiments in simulation and 4 robot platforms in the real world.
arXiv Detail & Related papers (2024-12-18T23:15:41Z) - Generalizable Humanoid Manipulation with 3D Diffusion Policies [41.23383596258797]
We build a real-world robotic system to address the problem of autonomous manipulation by humanoid robots.<n>Our system is mainly an integration of 1) a whole-upper-body robotic teleoperation system to acquire human-like robot data, and 2) a 25-DoF humanoid robot platform with a height-adjustable cart and a 3D LiDAR sensor.<n>We show that using only data collected in one scene and with only onboard computing, a full-sized humanoid robot can autonomously perform skills in diverse real-world scenarios.
arXiv Detail & Related papers (2024-10-14T17:59:00Z) - Know your limits! Optimize the robot's behavior through self-awareness [11.021217430606042]
Recent human-robot imitation algorithms focus on following a reference human motion with high precision.
We introduce a deep-learning model that anticipates the robot's performance when imitating a given reference.
Our Self-AWare model (SAW) ranks potential robot behaviors based on various criteria, such as fall likelihood, adherence to the reference motion, and smoothness.
arXiv Detail & Related papers (2024-09-16T14:14:58Z) - GRUtopia: Dream General Robots in a City at Scale [65.08318324604116]
This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots.
GRScenes includes 100k interactive, finely annotated scenes, which can be freely combined into city-scale environments.
GRResidents is a Large Language Model (LLM) driven Non-Player Character (NPC) system that is responsible for social interaction.
arXiv Detail & Related papers (2024-07-15T17:40:46Z) - HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation [50.616995671367704]
We present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands.
Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies.
arXiv Detail & Related papers (2024-03-15T17:45:44Z) - Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap [45.6806234490428]
We benchmark current AIs in their abilities to imitate humans in three language tasks and three vision tasks.
Experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges.
Results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges.
arXiv Detail & Related papers (2022-11-23T16:16:52Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Fleet-DAgger: Interactive Robot Fleet Learning with Scalable Human
Supervision [72.4735163268491]
Commercial and industrial deployments of robot fleets often fall back on remote human teleoperators during execution.
We formalize the Interactive Fleet Learning (IFL) setting, in which multiple robots interactively query and learn from multiple human supervisors.
We propose Fleet-DAgger, a family of IFL algorithms, and compare a novel Fleet-DAgger algorithm to 4 baselines in simulation.
arXiv Detail & Related papers (2022-06-29T01:23:57Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Hyperparameters optimization for Deep Learning based emotion prediction
for Human Robot Interaction [0.2549905572365809]
We have proposed an Inception module based Convolutional Neural Network Architecture.
The model is implemented in a humanoid robot, NAO in real time and robustness of the model is evaluated.
arXiv Detail & Related papers (2020-01-12T05:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.