Posted in

Gemma 范围:帮助安全社区揭示语言模型的内部运作_AI阅读总结 — 包阅AI

包阅导读总结

1. 关键词:Gemma Scope、语言模型、稀疏自编码器、机制可解释性、研究进展

2. 总结:

– 本文介绍了 Gemma Scope,这是一套用于语言模型可解释性的工具。

– 它由数百个稀疏自编码器组成,能帮助研究语言模型内部运作。

– 强调了其独特性和在推动机制可解释性研究方面的作用。

3. 主要内容:

– Gemma Scope

– 帮助安全社区了解语言模型内部运作

– 包括数百个用于 Gemma 2 的稀疏自编码器

– 语言模型内部运作

– 模型从大量数据学习,内部运作常是谜

– 机制可解释性研究致力于破解

– 稀疏自编码器的作用

– 分解激活为少量特征

– 发现模型使用的潜在特征

– Gemma Scope 的独特之处

– 对 Gemma 2 各层训练稀疏自编码器

– 采用新的 JumpReLU SAE 架构

– 推动领域发展

– 希望加速机制可解释性研究

– 期待社区将技术应用于现代模型,解决现实问题

思维导图:

文章地址:https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/

文章来源:deepmind.google

作者:Google DeepMind Blog

发布时间:2024/7/31 15:59

语言:英文

总字数:996字

预计阅读时间:4分钟

评分:91分

标签:语言模型可解释性,稀疏自动编码器,AI 安全,AI 透明度,Gemma 2


以下为原文内容

本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com

Technologies

Gemma Scope: helping the safety community shed light on the inner workings of language models

Published
Authors

Language Model Interpretability team

Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.

To create an artificial intelligence (AI) language model, researchers build a system that learns from vast amounts of data without human guidance. As a result, the inner workings of language models are often a mystery, even to the researchers who train them. Mechanistic interpretability is a research field focused on deciphering these inner workings. Researchers in this field use sparse autoencoders as a kind of ‘microscope’ that lets them see inside a language model, and get a better sense of how it works.

Today, we’re announcing Gemma Scope, a new set of tools to help researchers understand the inner workings of Gemma 2, our lightweight family of open models. Gemma Scope is a collection of hundreds of freely available, open sparse autoencoders (SAEs) for Gemma 2 9B and Gemma 2 2B. We’re also open sourcing Mishax, a tool we built that enabled much of the interpretability work behind Gemma Scope.

We hope today’s release enables more ambitious interpretability research. Further research has the potential to help the field build more robust systems, develop better safeguards against model hallucinations, and protect against risks from autonomous AI agents like deception or manipulation.

Try our interactive Gemma Scope demo, courtesy of Neuronpedia.

Interpreting what happens inside a language model

When you ask a language model a question, it turns your text input into a series of ‘activations’. These activations map the relationships between the words you’ve entered, helping the model make connections between different words, which it uses to write an answer.

As the model processes text input, activations at different layers in the model’s neural network represent multiple increasingly advanced concepts, known as ‘features’.

For example, a model’s early layers might learn to recall facts like that Michael Jordan plays basketball, while later layers may recognize more complex concepts like the factuality of the text.

A stylised representation of using a sparse autoencoder to interpret a model’s activations as it recalls the fact that the City of Light is Paris. We see that French-related concepts are present, while unrelated ones are not.

However, interpretability researchers face a key problem: the model’s activations are a mixture of many different features. In the early days of mechanistic interpretability, researchers hoped that features in a neural network’s activations would line up with individual neurons, i.e., nodes of information. But unfortunately, in practice, neurons are active for many unrelated features. This means that there is no obvious way to tell which features are part of the activation.

This is where sparse autoencoders come in.

A given activation will only be a mixture of a small number of features, even though the language model is likely capable of detecting millions or even billions of them – i.e., the model uses features sparsely. For example, a language model will consider relativity when responding to an inquiry about Einstein and consider eggs when writing about omelettes, but probably won’t consider relativity when writing about omelettes.

Sparse autoencoders leverage this fact to discover a set of possible features, and break down each activation into a small number of them. Researchers hope that the best way for the sparse autoencoder to accomplish this task is to find the actual underlying features that the language model uses.

Importantly, at no point in this process do we – the researchers – tell the sparse autoencoder which features to look for. As a result, we are able to discover rich structures that we did not predict. However, because we don’t immediately know the meaning of the discovered features, we look for meaningful patterns in examples of text where the sparse autoencoder says the feature ‘fires’.

Here’s an example in which the tokens where the feature fires are highlighted in gradients of blue according to their strength:

Example activations for a feature found by our sparse autoencoders. Each bubble is a token (word or word fragment), and the variable blue color illustrates how strongly the feature is present. In this case, the feature is apparently related to idioms.

What makes Gemma Scope unique

Prior research with sparse autoencoders has mainly focused on investigating the inner workings of tiny models or a single layer in larger models. But more ambitious interpretability research involves decoding layered, complex algorithms in larger models.

We trained sparse autoencoders at every layer and sublayer output of Gemma 2 2B and 9B to build Gemma Scope, producing more than 400 sparse autoencoders with more than 30 million learned features in total (though many features likely overlap). This tool will enable researchers to study how features evolve throughout the model and interact and compose to make more complex features.

Gemma Scope is also trained with our new, state-of-the-art JumpReLU SAE architecture. The original sparse autoencoder architecture struggled to balance the twin goals of detecting which features are present, and estimating their strength. The JumpReLU architecture makes it easier to strike this balance appropriately, significantly reducing error.

Training so many sparse autoencoders was a significant engineering challenge, requiring a lot of computing power. We used about 15% of the training compute of Gemma 2 9B (excluding compute for generating distillation labels), saved about 20 Pebibytes (PiB) of activations to disk (about as much as a million copies of English Wikipedia), and produced hundreds of billions of sparse autoencoder parameters in total.

Pushing the field forward

In releasing Gemma Scope, we hope to make Gemma 2 the best model family for open mechanistic interpretability research and to accelerate the community’s work in this field.

So far, the interpretability community has made great progress in understanding small models with sparse autoencoders and developing relevant techniques, like causal interventions, automatic circuit analysis, feature interpretation, and evaluating sparse autoencoders. With Gemma Scope, we hope to see the community scale these techniques to modern models, analyze more complex capabilities like chain-of-thought, and find real-world applications of interpretability such as tackling problems like hallucinations and jailbreaks that only arise with larger models.