AI Safety Notes

I’m part of an AI Safety reading group here at Carnegie Mellon, and I thought I’d type up my notes and thoughts on some of our readings and share them here. I am also updating this page with other notable safety-related things I’ve read.

Most of what I have written is summarization and synthesis. Assume any figures I use are not my own.

Constitutional AI: Harmlessness from AI Feedback
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Toy Models of Superposition
- Mechanistic Interpretability Explainer & Glossary by Neel Nanda
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Constitutional AI: Harmlessness from AI Feedback

Reinforcement learning from human feedback (RLHF) has become an established method of aligning AI models to human values. However, there may be fundamental limitations to RLHF. For example, obtaining quality human feedback in large quantities is difficult; humans can themselves be misaligned, whether intentional or unintentional. Furthermore, as AI models become more capable, and perhaps more intelligent than ourselves, we cannot rely on human supervision to align our AI systems; AI performance may exceed human performance, and so we will need other ways to evaluate, oversee, and align. That is, we need to be able to scale supervision as a part of addressing the superalignment problem. The authors also wanted to address the tension between helpfulness and harmlessness. For example, a model may always reply with “:),” which is very much harmless, but unfortunately it is completely unhelpful. Researchers at Anthropic developed a method they call Constitutional AI (CAI), aligning a AI model without any human feedback. Instead, the authors wrote a “constitution” of 10 principles in natural natural language, chosen fairly ad hoc for the purposes of this research. Their approach, taken from the paper

Approach: There are two stages to this extreme form of scaled supervision without human feedback.
- (Supervised stage): Begin by prompting a helpful-only AI assistant with prompts that elicit harmful behaviors. The model is asked to critique and revise its response with respect to the provided constitution. Then, a pretrained language model is fine-tuned using supervised learning on these final revised responses. The goal here is reduce the total length of training required in next RL stage.
- (RL stage): Instead of performing RLHF, the idea is to perform ‘RLAIF.’ Using the fine-tuned supervised model from the first stage, a pair of responses is generated to each of the harmful prompts. The prompt, along with the pair of responses, becomes a multiple choice question where the responses are evaluated based on the provided constitution. Then, as in RLHF, a preference model is trained to learn from this comparison data, and the supervised model from the first stage if fined tuned again using RL against the preference model.
Results:
- The researchers found that as language models become more capable, their ability to identify harm improves, especially when using chain-of-thought reasoning; repeated cycles of critiques and revisions improve harmlessness; and self-supervised preference labels improves model behavior.

I found it interesting that they would use the same model to revise and critique itself in the SL stage. The authors note that they guided the model into understanding what is was being asked to do using few-shot prompting as the model would sometimes confuse it’s perspective.

Something else I found interesting was how their CAI models could be overtrained, resulting in boilerplate phrases like “you are valid, valued, and cared for,” which I thought was a humorous instance of Goodhart’s law. They found that smoothing out labels and ensembling labels generated from the constitutional principles (as they prompted their models with one principle at a time, which I glossed over) reduced this undesired behavior.

They also found that their RLAIF model was “virtually never evasive,” which was pretty amazed to read. Of course, the constitutional principles chosen and the fine-tuning done in practice are probably quite different than the ones chosen ad hoc for this paper, but I was surprised to hear that evasive behavior from LLMs did not need to come at a the cost of a significant decrease in helpfulness. You would think that an intelligent enough system would be able to provide a nuanced (and helpful) response to any sort of prompt, whether the prompt is harm-eliciting or not, and it seems like this is true.

I think it would be interesting to see if this non-evasive behavior continues as the constitutional principles become more restrictive and complex (because at some point, Anthropic cuts off responses from it’s Claude 3 models). It was discussed that the number of constitutional principles seemed to be independent of harmlessness and helpfulness, but I would think that the semantic content of these principles matter more than quantity. I would have to look into this more.

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

This paper considers if and how we can align and supervise superhuman models. Of course, we do not yet have superhuman models, so the paper seeks to test if weak models can effectively supervise strong models. The authors experiment if GPT-2 trained on ground truth can act as a supervisor for GPT-4. They found that it can. Fine-tuning using a GPT-2-level supervisor and an auxiliary confidence loss can produce a GPT-3.5-level performance.

Scalable oversight is a big topic in alignment. The more intelligent and complex models get, the harder it is for humans to reliably evaluate the output of these models. By definition, a superhuman model would generate outputs beyond the capabilities of humans. In the event that we develop superhuman AI, we still want to be able to effectively and efficiently align these models.
Question: Humans will be weak supervisors compared to superhuman AI. Can we use a weak model to supervise a strong model?
Intuition: We don’t want a stronger model to imitate a weaker model. After all, the stronger model is more capable, more intelligent, and more knowledgeable; we do not want the stronger model to behave like the weaker model. When we generate labeled data using a fine-tuned, weak model, and use these labels as fine-tuning data for the stronger model, we hope that the stronger model already contains the capabilities to correctly perform the tasks we care about, and so instead of imitating the incorrect output from the weaker, fine-tuned model, the stronger model should pick up on the aligned behavior of the weaker, fine-tuned model. Therefore, we will hope to produce a stronger, aligned model.
Method: We generate weak labels by taking the outputs of a fine-tuned weak model. Now, let us fine-tune a strong model with the data labeled by our weak model. Because we don’t have superhuman AI, we can fine-tune our strong model with the ground truth and use it as comparison. The authors operationalize the intuition that the student (the stronger model) should learn the intent of the supervisor (the weaker model), but not imitate the incorrect behavior of the weaker supervisor, using auxiliary confidence loss term. When a strong model’s output differs from the weak model’s label, the additional loss term reinforces the strong model’s confidence in its own answer.

Toy Models of Superposition

This paper is part Anthropic’s mechanistic interpretability research. The field of mech interp focuses on reverse engineering neural networks into something understandable/interpretable.

Definitions and Motivation:

Linear representation hypothesis: neural networks represent features of the input as directions in activation space. For example, Mikolov et al. famously found that vector(“King”) - vector(“Man”) + vector(“Woman”) results in a vector close to vector(“Queen”).
- So what are these directions? The vector space of a neural network layer’s activations is called the “representation.” The basis dimensions in this activation space are described as privileged or non-privileged. In a representation where basis dimensions are non-privileged, features can be embedded along any direction (i.e. the basis dimensions are not special). In a representation where the basis dimensions are privileged, features tend to align with those basis dimensions. Privileged basis directions are called “neurons.” (Neurons respond to particular patterns of input, much like how privileged basis directions correspond to particular interpretable features of the input.)
Superposition hypothesis: when neural networks attempt to represent more features than there are neurons, features correspond to almost-orthogonal directions in activation space.
- There exists significant empirical evidence that neurons can be “polysemantic” even when there is a privileged basis. The presence of one feature may cause a slight activation of another feature in activation space, but this interference is the cost of superposition.
- The idea is that a smaller neural network may noisily simulate a larger neural network.

Experiment:

Now, the researchers set out to demonstrate the superposition hypothesis. Let there be an ideal larger model with activation $x \in \mathbb{R}^n$ which perfectly captures all features. The question is whether $x$ can be projected into a lower dimensional vector $h \in \mathbb{R}^m$ then recovered ($x \to x’$), simulating how a non-ideal smaller model might nosily simulate an ideal larger model. This setup can be thought of an autoencoder, and the goal here is interpret the behavior of a simple autoencoder. Setup Consider two models:

Linear model (no activation): $h = Wx, x’ = W^Th + b$
ReLU model: $h = Wx, x’ = ReLU(W^Th + b)$

Results:

Here $n = 20, m = 5$. That is, there are 20 features of varying importance but the smaller model only has 5 dimensions to work with.
The top row of plots is a visualization of $W^TW$, which is some sort of identity mapping $x \to x’$.
The bottom row of plots represent the magnitude of the column vectors of $W$, which is the norm of the embedding vector for a feature. The color of the bar in the second row of plots measures how much a given feature shares its dimension with other features (calculated using a simple dot product). As you look to the right, sparsity of features is increasing.
The linear model with no activations represents the top 5 most important features with no superposition.
When there is no sparsity, the ReLU model also learns the top $m$ features.
As sparsity increases, more features are represented using superposition (the embedding vectors of more features becomes nonzero). The least important features are used in superposition, and they are arranged in these sort of opposite pairs.

These are just some of the main points. The paper discusses a lot more than what I have covered.

Bigger picture: To create safe AI, it would be useful to “identify and enumerate over all features.” That is, we want to be able to better understand what types of inputs lead to what types of outputs. Ideally, we would like to know if it is (im)possible to elicit harmful behavior from a model. Better understanding models that do and do not exhibit superposition would be a step in this direction.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Tags: Notes

AI Safety Notes

from a reading group I'm in, and more

Table of Contents

Constitutional AI: Harmlessness from AI Feedback

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Toy Models of Superposition

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning