#interpretability
3 posts · all tags
-
· H-Neurons: Hallucination at the Neuron Level
Gao et al. https://arxiv.org/abs/2512.01797 present a compelling investigation into the microscopic mechanisms of hallucination in LLMs. The central thesis is that hallucinations are not diffuse phenomena spread across millions of parameters, but are instead driven by a remarkably sparse subset of neurons -- fewer than 0.1% of the total -- which the authors term H-Neurons (Hallucination-associated Neurons).
-
· Mechanistic Interpretability - Some concepts
Here are some quick notes on concepts in Mechanistic Interpretability. The subject is vast and very recent and try to interpret features for neural networks, specifically transformers and LLM's.
-
· Sparse Autoencoders
Sparse autoencoders are neural networks that learn compressed representations of data while enforcing sparsity - a constraint that ensures most neurons remain inactive for any given input. This approach leads to more robust and interpretable features, often capturing meaningful patterns in the data.