Fascinating paper.
Sparse autoencoders => natural language autoencoders. These generate natural language descriptions of the "internal state" of a model at each token, like reading its mind (loss function: ability to use those descriptions to faithfully reconstruct the activations, kind of like an SAE, but the compressed representation is natural language).
Anthropic has shown how to generate these descriptions for frontier models, capturing great insights on confabulation, reward hacking, etc. Amazing interpretability work.
From X
Disclaimer: The above content reflects only the author's opinion and does not represent any stance of CoinNX, nor does it constitute any investment advice related to CoinNX.


