CoinNX | Haseeb ＞|＜

144KFollowers Haseeb ＞|＜

1.1KFollowing

Haseeb ＞|＜

@hosseeb

Managing partner @dragonfly_xyz. Let's think step by step.

Haseeb ＞|＜

Fascinating paper. Sparse autoencoders => natural language autoencoders. These generate natural language descriptions of the "internal state" of a model at each token, like reading its mind (loss function: ability to use those descriptions to faithfully reconstruct the activations, kind of like an SAE, but the compressed representation is natural language). Anthropic has shown how to generate these descriptions for frontier models, capturing great insights on confabulation, reward hacking, etc. Amazing interpretability work.

Anthropic @AnthropicAI ·

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text. https://t.co/pMLsxM2VAO

00:03:16

From X

Disclaimer: The above content reflects only the author's opinion and does not represent any stance of CoinNX, nor does it constitute any investment advice related to CoinNX.

Hot