6/25/2025
i guess the post frequency is gradually becoming once a week T_T anyways i am EXPECTED to post smth on substack soon so 👀lol what are these blogger emojis
here is most recent quanta article i'm reading: new pyramid shape that always lands the same side up - neat application of this thing -- space exploration! and it's interesting how ppl can just visualize these things, as someone who sucks at geometry T_T, esp in higher dimensions - conway was brilliant.
i'll just paste some literature review i did on some neural network stuff, some of the stuff is like copy-pasted from the paper abstracts (sorry!)
Activation Anomaly Analysis (Mar 2020)
a novel approach for anomaly detection based on the hidden activation patterns of NNs, semi-supervised, purely data-driven anomaly detection solution, transferability of algorithm
comprised of two parts:
a target network unrelated to the anomaly detection task
an alarm network analyzing the target’s activations
Experiments give high f1, precision, and recall scores
Datasets used: MNIST, EMNIST, CSE-CIC-IDS2018 (intrusion detection data set containing network data along with anomaly labels)
To prevent class imbalance issues, loss for anomalous samples is weighted higher than for normal samples
Subset-scanning on internal activations:
Borrowed from anomalous-pattern detection—scan for the most anomalous subset of activations within the AE
Uses Non-Parametric Scan Statistics (NPSS) combined with the Linear Time Subset Scanning (LTSS) property to efficiently detect anomalies in hidden layers
Detects anomalies by comparing test activations to a “clean” background distribution, computing p-values, and identifying subsets with unusually high activation deviations
Complementary pixel-space subset scanning:
Applies the same technique to reconstruction error: identifying groups of pixels that are distorted more than expected, which aids in interpretability
Weakly Supervised Detection of Hallucinations in LLM Activations (Dec 2023)
weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models
goal is to determine if a pre-trained LLM has internalized harmful anomalous patterns (e.g., hallucinations) by examining its internal states (node activations)
approach only requires access to samples labeled as “normal” (true)
Scanning approach:
Nodes: individual activation units in one or multiple layers (e.g., transformer encoder/decoder layers)
For each activation unit (node) j, compare its activation on a test sentence to the empirical distribution from the reference dataset.
Calculate an empirical p-value indicating how extreme the activation is relative to reference activations
Sentences: scan across a batch of test sentences to find clusters of anomalous activations in specific sentences
Deep Semi-Supervised Anomaly Detection (Feb 2020)
information-theoretic framework for deep anomaly detection based on the idea that the entropy of the latent distribution for normal data should be lower than the entropy of the anomalous distribution
DA3G: Detecting Adversarial Attacks by Analysing Gradients
a general end-to-end method to detect adversarial examples based on the analysis of neural networks’ gradients
target-alarm structure
Obfuscated Activations Bypass LLM Latent-Space Defenses (Feb 2025)
Multiresolution-Fractional Brownian motion (fBm) model and deep multiresolution analysis combined
estimate stochastic deep multiresolution fractal texture features for tumor tissues in brain MRI images
FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation
a geometric & probabilistic framework for unsupervised mechanistic anomaly detection in deep neural networks, geared towards adversarial attack mitigation
FACADE elucidates circuit contributions to the properties of high-dimensional activation modes, aiding in adversarial attack identification, seen as probabilistic outliers in geometric transformations
Steps:
probabilistic Dirichlet Process Mixture model for unsupervised clustering (DP-Means) to identify ”pseudoclass” modes in intermediate activation space for a given density threshold λ
find circuits responsible for pseudoclass formation and propagation through causal discovery and Automatic Circuit DisCovery (ACDC)
determine manifold and kernel density properties of pseudoclass propagation through circuits and in relation to final classes through mean-field theoretic approximation
generate a distribution over circuits as they contribute to changes in manifold properties of pseudoclasses as they propagate through the network, e.g. effective reduction in radius or dimension
Examining properties of individual neurons: [2502.06809] Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Mechanistic Anomaly Detection
https://github.com/ejnnr/cupbearer - library for it
[2504.08812] Mechanistic Anomaly Detection for "Quirky" Language Models
train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of "quirky" language models
Comments
Post a Comment