Memory-Distilled Selection for Noise-Robust Anomaly Detection

Sirojbek Safarov^*, Jaewoo Park^*†‡, Yoon Gyo Jung^*, Kuan-Chuan Peng, Wonchul Kim, Seongdeok Bang, Octavia Camps,

AIVEX Northeastern University Mitsubishi Electric Research Laboratories
ICML2026
^*Equal Contribution ^†Project lead ^‡Corresponding author

Paper Code

The MeDS pipeline, in three steps.

Abstract

Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16% image-level AUROC on MVTecAD at a 40% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.

Motivation

Modern factories rely on vision systems to spot defective products automatically. The standard recipe is unsupervised anomaly detection: train a model on images that are all assumed to be normal, then flag anything that deviates at inspection time. This works well — as long as the training set really is clean.

In practice, it never is. Curating and verifying large industrial datasets is expensive, and even careful manual screening lets a non-negligible fraction of defective images slip in unlabeled. Once this contamination enters training, the model starts treating defects as “normal,” and accuracy drops sharply as the noise ratio grows.

Existing noise-robust methods only cope partially: they tend to rely on strong assumptions — knowing the contamination ratio in advance, or how anomalous samples are distributed — so their best settings shift from one dataset and noise level to the next. That means fragile, per-setting hyperparameter tuning and a steep drop-off under heavy contamination. To our knowledge, no prior method stays accurate at both the image level (is this product defective?) and the pixel level (where is the defect?) across a wide range of noise ratios.

MeDS targets exactly this gap. It is a noise-robust training algorithm built around data selection: instead of assuming a clean dataset or a known contamination level, it learns to separate normal from contaminated samples directly from the noisy data, and then trains only on the part it trusts. In doing so it solves three problems that trip up prior methods:

No noise-ratio-specific tuning. A single fixed configuration works from perfectly clean data all the way up to 40% contamination — there is no need to know, estimate, or tune for the noise ratio.
Robustness at both levels. It stays accurate at the image and pixel level at once, instead of trading one off against the other as earlier methods do.
Strong performance under heavy noise. On MVTecAD it holds 99.16% image-level AUROC even at a 40% noise ratio, and it sets a new state of the art on VisA and Real-IAD under contamination.

As a bonus, the very scores that drive this selection can also rank a dataset from most- to least-suspicious — turning MeDS into a practical tool for cleaning contaminated datasets with far less human effort.

Problem Setup

Anomaly detection with noisy data. To formalize the setting above: unlike conventional anomaly detection, whose training set is assumed clean, here the training set

\[ \mathcal{D} \;=\; \mathcal{D}_{\text{normal}} \,\cup\, \mathcal{D}_{\text{anomaly}} \]

mixes normal images with a set of unlabeled anomalous images \(\mathcal{D}_{\text{anomaly}}\) — we are never told which image is which. The level of contamination is the noise ratio

\[ \text{noise ratio} \;=\; \frac{\lvert \mathcal{D}_{\text{anomaly}} \rvert}{\lvert \mathcal{D} \rvert}. \]

At test time, for an input image \(x \in \mathcal{X}\) the model produces a pixel-level anomaly score map \(s(x) \in \mathbb{R}^{H \times W}\), from which a single image-level score is obtained by spatial max-pooling:

\[ S(x) \;=\; \max_{h,w}\; s(x)_{hw}. \]

A good detector scores anomalies above normal samples — ideally \(S(x_{\text{anom}}) > S(x_{\text{norm}})\) at the image level, with the map \(s(x)\) highlighting the defective regions at the pixel level. The core challenge is to learn such a scorer using only the unlabeled, contaminated set \(\mathcal{D}\) and without knowing the noise ratio — so that it stays reliable across the full range from clean data to heavy contamination.

Preliminaries

MeDS combines two standard anomaly-detection mechanisms with one well-known property of how neural networks learn.

1. Memory-based AD. A memory bank \(\mathcal{M}\) stores patch-level features of the training images; a patch is then scored by its distance to the nearest stored feature, so unfamiliar (likely anomalous) patches score high. With a frozen, pretrained encoder \(g\) (e.g. an ImageNet-pretrained backbone) that maps an image \(x\) to a feature map \(g(x) \in \mathbb{R}^{H \times W \times C}\), the memory anomaly score at spatial position \((h,w)\) is

\[ s_{\mathcal{M}}(x)_{hw} \;=\; \min_{z \in \mathcal{M}} D\!\left(z,\, g(x)_{hw}\right), \]

where \(D\) is the Euclidean distance in feature space. The memory is sampled from the pool of all training patch features,

\[ g(\mathcal{D}) \;:=\; \bigl\{\, g(x)_{hw} \;\bigm|\; x \in \mathcal{D},\; (h,w) \in [H]\times[W] \,\bigr\}, \]

typically via coreset subsampling — a compact subset chosen to approximate the full feature set. Because the encoder \(g\) is frozen, this score is robust to noise but only coarse.

2. Teacher–student (reconstruction) AD. Here the anomaly score is the disagreement between the frozen teacher \(g\) and a trainable student \(f_\theta\) — the reconstruction score

\[ s_{\theta}(x)_{hw} \;=\; D\!\left(g(x)_{hw},\, f_\theta(x)_{hw}\right), \]

and the student is trained to minimize it over the training set,

\[ \min_{\theta}\; \frac{1}{\lvert \mathcal{D} \rvert} \sum_{x \in \mathcal{D}} s_\theta(x). \]

The student is deliberately limited in capacity (e.g. a bottleneck decoder), so it reconstructs in-distribution (normal) patterns well and out-of-distribution (anomalous) ones poorly — making anomalies stand out.

3. The early-learning bias. MeDS also leans on a well-documented property of neural-network training: when the targets are noisy, a network first fits the dominant, consistent patterns and only later memorizes the rare, inconsistent ones. Because normal patches greatly outnumber anomalous ones, a model trained on contaminated data drives the scores of normal patches down earlier and more strongly than those of anomalies — a built-in denoising effect.

Putting it together, MeDS uses the memory score as a noise-robust starting signal, distills it into the reconstruction model (exploiting the early-learning bias), and then fine-tunes that model on the data it trusts — the three steps described next.

How MeDS works

MeDS reaches that robustness with a three-step recipe: (1) get a rough but noise-resistant read on which samples look suspicious, (2) sharpen that signal by distilling it into a small trained network, and (3) repeatedly retrain on the cleanest samples it can find — progressively earning precise, pixel-level defect localization. The three steps are detailed below.

Step 1: Memory Score Construction

The idea. A memory bank stores little snapshots (patch features) of the training images. To score a new patch, we measure how far it is from its nearest neighbor in the memory — far away means “unfamiliar,” which usually means a defect. The catch under noisy data: if defective patches sneak into the memory, defects start to look normal and the detector is fooled.

Our fix. Instead of one big memory, we build an ensemble of \(B\) small memories, each from a different random \(10\%\) subsample of the features, and average their scores. Because each memory is sparse, it only reliably captures the patterns that show up often — the normal ones — and “blurs out” rare defect patterns. In short, sparse subsampling acts like a low-pass filter: defects end up far from every memory and stand out, whether contamination is light or heavy.

\[ S_{\mathbb{M}}(x) \;=\; \frac{1}{B} \sum_{b=1}^{B} s_{\mathcal{M}_b}(x) \]

Why there is a sweet spot. Make the memory too large and defects leak in and look normal; too small and even ordinary patches look unfamiliar. Theorem 1 makes this precise.

Theorem 1. Under a regularity condition, for any anomaly and normal patch features \(q_{\text{anom}}\) and \(q_{\text{norm}}\), the expected gap

\[ \Delta(m) \;:=\; \mathbb{E}\!\left[D(q_{\text{anom}}, \mathcal{M})\right] - \mathbb{E}\!\left[D(q_{\text{norm}}, \mathcal{M})\right] \]

satisfies \(\Delta(m) > 0\), and is decomposed into a second-order Taylor approximation \(\Delta_0\) with remainder \(\epsilon_0(m)\),

\[ \Delta(m) \;=\; \Delta_0(m) + \epsilon_0(m), \qquad \Delta_0(m) \;=\; \int_0^{\infty} \delta(r)\,\omega(m, r)\, \mathrm{d}r, \]

where \(\delta\) is non-negative and the weight function \(\omega(m, r)\) is unimodal with respect to \(m\). The expectation \(\mathbb{E}\) is taken over the memory \(\mathcal{M}\) randomly subsampled from the extracted feature set \(g(\mathcal{D})\) under the constraint \(\lvert \mathcal{M} \rvert = m\).

In words: the normal–anomaly gap \(\Delta(m)\) is always positive, and because the weight \(\omega\) is unimodal in \(m\), it is maximized at a moderate memory size. So there is an appropriate subsampling ratio that drives normal and abnormal features as far apart as possible.

Normal vs. anomaly score separation and AUROC across subsampling ratios at 10/20/40% noise

Theorem 1 in practice (at \(10\%/20\%/40\%\) noise): the gap between normal (blue) and anomaly (red) scores — and the resulting AUROC (green) — peaks at a small-to-moderate subsampling ratio, exactly the predicted sweet spot.

One limitation: these features come from a frozen, general-purpose encoder, so this read is only coarse — good enough to flag which images are contaminated, but not yet precise enough to outline a defect pixel by pixel. Steps 2 and 3 fix that.

Step 2: Distillation

The idea. The Step 1 scores are robust but coarse, because they rely on a frozen encoder. So we train a small student network (a reconstruction model \(s_\theta\)) to reproduce the memory scores — but now learning directly from the factory images, so it picks up domain-specific detail the frozen encoder missed and amplifies the gap between normal and defective patches.

Why it doesn’t just memorize the defects too. This is where the early-learning bias (see Preliminaries) pays off: the student fits the majority — normal — patterns first, so it lowers their scores earlier and more than the rare defective ones, effectively denoising the memory signal for free.

The catch. Train too long and the network eventually memorizes the defects as well, hurting fine pixel-level accuracy; stop too early and it is under-trained for precise localization (see the graph on the right). Distillation alone can’t have it both ways — which is exactly what Step 3 resolves.

Step 3: Progressive Data Selection & Fine-tuning

Step 3: Progressive Data Selection and Finetune

The idea. Use the Step 2 model to pick out the cleanest-looking images and fine-tune only on those. This lets the model keep training — getting sharper at outlining defects — without ever overfitting to the contaminated samples. We alternate (select → train), fine-tuning \(s_\theta\) on the progressively selected clean subset \(\mathcal{S}_t\):

\[ \min_{\theta}\; \frac{1}{|\mathcal{S}_t|} \sum_{x \in \mathcal{S}_t} s_\theta(x) \]

Selection that self-corrects. A sample is kept when its score \(\eta_t(x)\) falls below a threshold \(\tau_t\). Early on, selection trusts the original distilled model and keeps only clearly-normal samples; as training improves, it leans on the current model and admits more samples — a positive feedback loop where a better model yields a cleaner subset, which in turn yields a better model.

\[ \eta_t(x) \;=\; (1 - \alpha_t)\, \max_{h, w} s_{\theta_0}(x)_{hw} \;+\; \alpha_t\, \max_{h, w} s_{\theta}(x)_{hw}, \qquad \alpha_t = \min\!\left(1,\; \tfrac{2t}{T}\right) \]

\[ \tau_t \;=\; \operatorname{Median}\!\left(\eta_t(x)\right) + k_t \cdot \operatorname{MAD}\!\left(\eta_t(x)\right), \qquad k_t = k\!\left(\tfrac{t}{T}\right) \]

Here \(\max_{h,w} s(x)_{hw}\) is the image-level score, \(s_{\theta_0}\) is the fixed initial (distilled) model and \(s_\theta\) the current one, \(\alpha_t\) gradually shifts trust from the initial to the current model over the \(T\) iterations, and the threshold \(\tau_t\) — built from the robust median and \(\operatorname{MAD}\) (median absolute deviation) — loosens as \(k_t\) grows. The payoff: precise pixel-level localization that stays robust at any contamination level, with no noise-specific tuning.

Results

We evaluate MeDS on three industrial benchmarks — MVTecAD, VisA, and Real-IAD — under contamination, following each dataset’s noisy-data protocol (MVTecAD at \(10/20/40\%\) noise, VisA at \(2/5/10\%\), plus the clean \(0\%\) case, and the official noisy setting for Real-IAD). The takeaway: where prior methods degrade as contamination rises, MeDS holds its accuracy — reaching 99.16% image-level AUROC on MVTecAD even at 40% noise — and sets state-of-the-art on VisA and Real-IAD under noise. It also works as a plug-in: dropped on top of existing detectors (HVQ, Dinomaly, INP-Former), it boosts their noise robustness with almost no cost on clean data.

Table 1 — MVTecAD. Image-level (I-AUROC, I-AP) and pixel-level (P-AUPRO, P-AP) scores at \(0\!-\!40\%\) noise. The top rows are dedicated noise-robust detectors (SoftPatch, InReach, FUN-AD); below, MeDS is added on top of three modern backbones (HVQ, Dinomaly, INP-Former), and bold marks the winner of each baseline-vs-MeDS pair. Adding MeDS improves every backbone under noise — at only marginal cost on clean (\(0\%\)) data — and the gap widens as contamination rises: Dinomaly’s image-level AUROC falls to \(87.4\%\) at \(40\%\) noise, while Dinomaly + MeDS holds 99.16%, with the biggest gains on the hardest pixel-level metric (P-AP).

Table 2: results on VisA across 0-10% noise

Table 2 — VisA. The same comparison at VisA’s noise ratios (\(0/2/5/10\%\)). MeDS once more lifts all three backbones on both image- and pixel-level metrics, with the largest improvements at the highest contamination — reproducing the MVTecAD trend on a second dataset.

Table 3: results on Real-IAD across 0-40% noise

Table 3 — Real-IAD. Evaluated under Real-IAD’s official single-class noisy protocol (a separate model per category, which baselines such as SoftPatch require to scale to its many categories). Dashes mark metrics a baseline does not report (some methods report only a subset of the four image-/pixel-level metrics). Dinomaly + MeDS achieves the best overall balance across image and pixel metrics, and its margin grows with noise — e.g. image-level AUROC of 90.99% vs. \(87.78\%\) for Dinomaly at \(40\%\) noise.

Ablation & Hyperparameter Analysis

Component ablation (Table 4). Adding MeDS’s three stages one at a time, on top of a Dinomaly backbone, shows that each is necessary. Reading the table top to bottom — from the bare baseline to the full method — the metrics improve at every step, and the gains are largest under heavy noise:

The bootstrapped memory ensemble alone sharply improves image-level robustness under noise (e.g. I-AUROC at \(40\%\) noise jumps from \(90.45 \to 92.22\)), but its pixel-level localization stays limited — it relies on frozen features.
Distillation transfers that coarse robustness into a trainable network and refines its features, giving consistent gains across all metrics.
Progressive fine-tuning improves both levels further, with the largest jump on the pixel-level P-AP metric (up to \(68.05\) at \(40\%\) noise) — precise segmentation without overfitting to the contaminated samples.

The last rows isolate the fine-tuning design and confirm both halves matter: using the distilled model for both initialization (“init”) and sample selection (“criteria”) beats using the raw memory ensemble or a randomly-initialized model.

Table 4: component ablation of MeDS on MVTecAD (Dinomaly backbone) across 0-40% noise

Robustness to hyperparameters. A big reason MeDS needs no noise-ratio-specific tuning is that it is insensitive to most of its hyperparameters, and a single fixed subsampling ratio works across every noise level. The figure below sweeps the final model’s image-level AUROC (top row) and pixel-level P-AP (bottom row) against three hyperparameters, with a separate curve per noise ratio (\(0/10/20/40\%\)):

Ensemble size \(B\) (left). Both metrics climb steeply from a single memory and flatten by \(B \approx 50\); larger ensembles mainly stabilize the high-noise curves (the \(40\%\) case gains the most). We use \(B = 100\), comfortably inside the saturated region.
Subsampling ratio \(\rho\) (middle). There is a clear peak around \(\rho = 0.1\): too small wastes data, while too large lets anomalous features leak into the memory and erodes the normal–anomaly gap — most damaging at \(40\%\) noise. Crucially, the same \(\rho = 0.1\) is best at every noise level, which is exactly why no per-noise tuning is needed (and matches the sweet spot predicted by Theorem 1).
Distillation iterations \(T\) (right). Performance rises and then plateaus instead of degrading. This is the payoff of progressive selection: where the distilled-only model of Step 2 would start overfitting to noise, fine-tuning on the self-selected clean subset lets the model train far longer safely — so the exact stopping point no longer matters.

Final-model robustness to ensemble size B, subsampling ratio rho, and iterations T across noise levels

Active Label Correction

Beyond fully-automatic detection, MeDS doubles as a tool for cleaning datasets. By ranking training images from most- to least-suspicious (using its selection scores), it lets a human reviewer find and remove contaminated samples while inspecting far fewer images — instead of checking every single one. Concretely, MeDS pushes contaminated samples higher up the list (better ranking quality, measured by AUPRC) and shrinks the inspection depth — the fraction of the data you must review to catch all the contamination — cutting annotation effort across every noise level.

Poster

BibTeX

        
          @inproceedings{safarov2026meds,
            title={{MeDS}: Memory-Distilled Selection for Noise-Robust Anomaly Detection},
            author={Sirojbek Safarov and Jaewoo Park and Yoon Gyo Jung and Kuan-Chuan Peng and Wonchul Kim and Seongdeok Bang and Octavia Camps},
            booktitle={International Conference on Machine Learning (ICML)},
            url={https://arxiv.org/abs/2605.26676},
            year={2026}
          }