We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: How to use and interpret activation patching

Abstract: Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.
Comments: A tutorial on activation patching. 13 pages, 2 figures
Subjects: Machine Learning (cs.LG)
Cite as: arXiv:2404.15255 [cs.LG]
  (or arXiv:2404.15255v1 [cs.LG] for this version)

Submission history

From: Stefan Heimersheim [view email]
[v1] Tue, 23 Apr 2024 17:42:29 GMT (290kb,D)

Link back to: arXiv, form interface, contact.