#interpretability

3 posts · all tags

Jan 27, 2026 · H-Neurons: Hallucination at the Neuron Level

Gao et al. https://arxiv.org/abs/2512.01797 present a compelling investigation into the microscopic mechanisms of hallucination in LLMs. The central thesis is that hallucinations are not diffuse phenomena spread across millions of parameters, but are instead driven by a remarkably sparse subset of neurons -- fewer than 0.1% of the total -- which the authors term H-Neurons (Hallucination-associated Neurons).
Apr 12, 2025 · Mechanistic Interpretability - Some concepts

Here are some quick notes on concepts in Mechanistic Interpretability. The subject is vast and very recent and try to interpret features for neural networks, specifically transformers and LLM's.
Nov 10, 2024 · Sparse Autoencoders

Sparse autoencoders are neural networks that learn compressed representations of data while enforcing sparsity - a constraint that ensures most neurons remain inactive for any given input. This approach leads to more robust and interpretable features, often capturing meaningful patterns in the data.