References & Citations
Computer Science > Machine Learning
Title: How to use and interpret activation patching
(Submitted on 23 Apr 2024)
Abstract: Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.
Submission history
From: Stefan Heimersheim [view email][v1] Tue, 23 Apr 2024 17:42:29 GMT (290kb,D)
Link back to: arXiv, form interface, contact.