Broadcasts.com - "19 - Mechanistic Interpretability with Neel Nanda" (AXRP

Technology
SEE MORE
- classical
- general
- talk
- News
- Family
- Bürgerfunk
- pop
- Islam
- soul
- jazz
- Comedy
- humor
- wissenschaft
- opera
- baroque
- gesellschaft
- theater
- Local
- alternative
- electro
- rock
- rap
- lifestyle
- Music
- como
- RNE
- ballads
- greek
- Buddhism
- deportes
- christian
- piano
- djs
- Dance
- dutch
- flamenco
- social
- hope
- christian rock
- academia
- afrique
- Business
- musique
- ελληνική-μουσική
- religion
- World radio
- Zarzuela
- travel
- World
- NFL
- media
- Art
- public
- Sports
- Gospel
- st.
- baptist
- Leisure
- Kids & Family
- musical
- club
- Culture
- Health & Fitness
- True Crime
- Fiction
- children
- Society & Culture
- TV & Film
- gold
- kunst
- música
- gay
- Natural
- a
- francais
- bach
- economics
- kultur
- evangelical
- tech
- Opinion
- Government
- gaming
- College
- technik
- History
- Jesus
- Health
- movies
- radio
- services
- Church
- podcast
- Education
- international
- Transportation
- Other
- kids
- podcasts
- philadelphia
- Noticias
- love
- sport
- Salud
- film
- and
- 4chan
- Disco
- Stories
- fashion
- Arts
- interviews
- hardstyle
- entertainment
- humour
- medieval
- literature
- alma
- Cultura
- video
- TV
- Science
- en

19 - Mechanistic Interpretability with Neel Nanda

Published: Feb. 4, 2023, 2:56 a.m.

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.

\xa0

Topics we discuss, and timestamps:

\xa0- 00:01:05 - What is mechanistic interpretability?

\xa0- 00:24:16 - Types of AI cognition

\xa0- 00:54:27 - Automating mechanistic interpretability

\xa0- 01:11:57 - Summarizing the papers

\xa0- 01:24:43 - 'A Mathematical Framework for Transformer Circuits'

\xa0\xa0 - 01:39:31 - How attention works

\xa0\xa0 - 01:49:26 - Composing attention heads

\xa0\xa0 - 01:59:42 - Induction heads

\xa0- 02:11:05 - 'In-context Learning and Induction Heads'

\xa0\xa0 - 02:12:55 - The multiplicity of induction heads

\xa0\xa0 - 02:30:10 - Lines of evidence

\xa0\xa0 - 02:38:47 - Evolution in loss-space

\xa0\xa0 - 02:46:19 - Mysteries of in-context learning

\xa0- 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'

\xa0\xa0 - 02:50:57 - How neural nets learn modular addition

\xa0\xa0 - 03:11:37 - The suddenness of grokking

\xa0- 03:34:16 - Relation to other research

\xa0- 03:43:57 - Could mechanistic interpretability possibly work?

\xa0- 03:49:28 - Following Neel's research

\xa0

The transcript:\xa0axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html

\xa0

Links to Neel's things:

\xa0- Neel on Twitter: twitter.com/NeelNanda5

\xa0- Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1

\xa0- Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability

\xa0- TransformerLens: github.com/neelnanda-io/TransformerLens

\xa0- Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic

\xa0- Neel on YouTube: youtube.com/@neelnanda2469

\xa0- 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj

\xa0- Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J

\xa0

Writings we discuss:

\xa0- A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html

\xa0-\xa0In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

\xa0- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217

\xa0- Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052

\xa0-\xa0interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

\xa0- Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262

\xa0- Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097

\xa0-\xa0Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN

\xa0- An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143

\xa0- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593

\xa0- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

\xa0- The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544

\xa0- Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration

\xa0- Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913

\xa0 - Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves

\xa0- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635

\xa0- Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a

\xa0