Taming AI Bots: Controllability of Neural States in Large Language
Models
- URL: http://arxiv.org/abs/2305.18449v1
- Date: Mon, 29 May 2023 03:58:33 GMT
- Title: Taming AI Bots: Controllability of Neural States in Large Language
Models
- Authors: Stefano Soatto, Paulo Tabuada, Pratik Chaudhari, Tian Yu Liu
- Abstract summary: We first introduce a formal definition of meaning'' that is amenable to analysis.
We then characterize meaningful data'' on which large language models (LLMs) are ostensibly trained.
We show that, when restricted to the space of meanings, an AI bot is controllable.
- Score: 81.1573516550699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the question of whether an agent can, by suitable choice of
prompts, control an AI bot to any state. To that end, we first introduce a
formal definition of ``meaning'' that is amenable to analysis. Then, we
characterize ``meaningful data'' on which large language models (LLMs) are
ostensibly trained, and ``well-trained LLMs'' through conditions that are
largely met by today's LLMs. While a well-trained LLM constructs an embedding
space of meanings that is Euclidean, meanings themselves do not form a vector
(linear) subspace, but rather a quotient space within. We then characterize the
subset of meanings that can be reached by the state of the LLMs for some input
prompt, and show that a well-trained bot can reach any meaning albeit with
small probability. We then introduce a stronger notion of controllability as
{\em almost certain reachability}, and show that, when restricted to the space
of meanings, an AI bot is controllable. We do so after introducing a functional
characterization of attentive AI bots, and finally derive necessary and
sufficient conditions for controllability. The fact that AI bots are
controllable means that an adversary could steer them towards any state.
However, the sampling process can be designed to counteract adverse actions and
avoid reaching undesirable regions of state space before their boundary is
crossed.
Related papers
- Strong and weak alignment of large language models with human values [1.6590638305972631]
Minimizing negative impacts of Artificial Intelligent (AI) systems requires them to be able to align with human values.
We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted.
We propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal.
arXiv Detail & Related papers (2024-08-05T11:27:51Z) - Grounding Language Plans in Demonstrations Through Counterfactual Perturbations [25.19071357445557]
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI.
We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
arXiv Detail & Related papers (2024-03-25T19:04:59Z) - Recurrent Neural Language Models as Probabilistic Finite-state Automata [66.23172872811594]
We study what classes of probability distributions RNN LMs can represent.
We show that simple RNNs are equivalent to a subclass of probabilistic finite-state automata.
These results present a first step towards characterizing the classes of distributions RNN LMs can represent.
arXiv Detail & Related papers (2023-10-08T13:36:05Z) - Getting from Generative AI to Trustworthy AI: What LLMs might learn from
Cyc [0.0]
Generative AI, the most popular current approach to AI, consists of large language models (LLMs) that are trained to produce outputs that are plausible, but not necessarily correct.
We discuss an alternative approach to AI which could theoretically address many of the limitations associated with current approaches.
arXiv Detail & Related papers (2023-07-31T16:29:28Z) - Robots That Ask For Help: Uncertainty Alignment for Large Language Model
Planners [85.03486419424647]
KnowNo is a framework for measuring and aligning the uncertainty of large language models.
KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion.
arXiv Detail & Related papers (2023-07-04T21:25:12Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents [111.15288256221764]
Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives.
We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
arXiv Detail & Related papers (2023-03-01T22:58:50Z) - Provably Sample-Efficient RL with Side Information about Latent Dynamics [12.461789905893026]
We study reinforcement learning in settings where observations are high-dimensional, but where an RL agent has access to abstract knowledge about the structure of the state space.
We present an algorithm, called TASID, that learns a robust policy in the target domain, with sample complexity that is in the horizon.
arXiv Detail & Related papers (2022-05-27T21:07:03Z) - On Adversarial Examples and Stealth Attacks in Artificial Intelligence
Systems [62.997667081978825]
We present a formal framework for assessing and analyzing two classes of malevolent action towards generic Artificial Intelligence (AI) systems.
The first class involves adversarial examples and concerns the introduction of small perturbations of the input data that cause misclassification.
The second class, introduced here for the first time and named stealth attacks, involves small perturbations to the AI system itself.
arXiv Detail & Related papers (2020-04-09T10:56:53Z) - Self-explaining AI as an alternative to interpretable AI [0.0]
Double descent indicates that deep neural networks operate by smoothly interpolating between data points.
Neural networks trained on complex real world data are inherently hard to interpret and prone to failure if asked to extrapolate.
Self-explaining AIs are capable of providing a human-understandable explanation along with confidence levels for both the decision and explanation.
arXiv Detail & Related papers (2020-02-12T18:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.