NoPropagation: Ankit Sanjyal

No-Propagation Diffusion Transformers (NoPropDT)

A clean PyTorch implementation of a backpropagation-free training algorithm: achieving 99% accuracy on MNIST using only local layer-wise denoising targets.

PyTorch Diffusion Backprop-Free Learning MNIST CIFAR-10 Python 100% 5 ⭐ on GitHub

GitHub Original Paper (Oxford)

✅ Result: ~99% validation accuracy on MNIST by epoch 7 and ~76% on CIFAR-10 by epoch 50 — achieved without a single gradient flowing backward through the network.

The Big Idea

Every modern neural network is trained with backpropagation: computing gradients of a global loss and pushing them backward through every layer. Backprop is powerful, but it is:

Biologically implausible: real neurons don't receive error signals from the future
Memory-intensive: requires storing all intermediate activations for the backward pass
Hard to parallelize layer-wise: each layer waits for the layer above to finish its backward pass

NoPropDT (from researchers at the University of Oxford) replaces backprop with a stack of local denoising blocks. Each block learns to denoise a class embedding toward the correct label — no global gradient, no backward pass through the whole network.

How NoPropDT Works

The intuition maps directly to diffusion models: start with a noisy guess and iteratively denoise it.

Start with a noisy class embedding (Gaussian noise added to the label embedding)
A DenoiseBlock (CNN + MLP) processes the image alongside the noisy embedding and predicts a cleaner version
This is repeated T times: each block is trained with a local MSE loss only (no chain rule across blocks)
A final linear classifier reads the fully denoised embedding and predicts the class

Code flow:
main.py
 └─→ experiments/run_mnist_dt.py
     ├─→ data/mnist_loader.py         # MNIST DataLoaders
     ├─→ models/no_prop_dt.py         # NoPropDT model + DenoiseBlock
     └─→ trainer/train_nopropdt.py    # Layer-wise local MSE training

Results

Dataset	Variant	Accuracy	Epochs to convergence
MNIST	With nonlinear decoder	~99%	7
MNIST	No decoder	~97%	10
CIFAR-10	With nonlinear decoder	~76%	50
CIFAR-10	No decoder	~68%	50

All results achieved without backpropagation. Training uses only local per-layer MSE losses.

Why I Built This

This was a deep-dive reproduction exercise: reading a research paper (Oxford, 2025) and rebuilding it from scratch in a clean, modular PyTorch codebase, extending it to CIFAR-10 and adding a nonlinear decoder variant. The goal was to deeply understand an alternative training paradigm that could matter for neuromorphic hardware, federated learning, and biologically plausible AI.

The repo has earned 5 ⭐ and is actively watched by researchers interested in backprop-free learning.