How to use and interpret activation patching

Heimersheim, Stefan; Nanda, Neel

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2404

Change to browse by:

Computer Science > Machine Learning

Title: How to use and interpret activation patching

Authors: Stefan Heimersheim, Neel Nanda

(Submitted on 23 Apr 2024)

Abstract: Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.

Comments:	A tutorial on activation patching. 13 pages, 2 figures
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2404.15255 [cs.LG]
	(or arXiv:2404.15255v1 [cs.LG] for this version)

Submission history

From: Stefan Heimersheim [view email]
[v1] Tue, 23 Apr 2024 17:42:29 GMT (290kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.15255

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: How to use and interpret activation patching

Submission history