In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Layerwise Sequential Unit Variance (LSUV)

A initialization technique for deep architectures

All You need is a Good Init

From the abstract:

Layer-sequential unit-variance (LSUV) initialization – a simple method for weight initialization for deep net learning – is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.
In [2]:
from exp.nb_08 import *

Data, Model and Runner

In [3]:
x_train, y_train, x_valid, y_valid = get_data()

train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)

nh,bs = 50,512
c = y_train.max().item()+1
loss_func = F.cross_entropy

data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)
In [4]:
In [5]:
mnist_view = view_tfm(1,28,28)
In [6]:
callbacks = [
    partial(AvgStatsCallback, accuracy),
    partial(BatchTransformXCallback, mnist_view),
In [7]:
nfs = [8,16,32,64,64]

Let's refactor our conv_layer with two new properties which will make it easier to apply the LSUV algorithm:

In [8]:
class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=0., **kwargs):
        self.conv = nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True)
        self.relu = GeneralRelu(sub=sub, **kwargs)
    def forward(self, x):
        return self.relu(self.conv(x))
    def bias(self): return -self.relu.sub
    def bias(self, v): self.relu.sub = -v
    def weight(self): return self.conv.weight
In [9]:
learn, run = get_learn_run(data, nfs, ConvLayer, 0.6, cbs=callbacks)

Here is our baseline without initializing the weights with LSUV:

In [10]:, learn)
train: [1.93039625, tensor(0.3342, device='cuda:0')]
valid: [0.90568271484375, tensor(0.6834, device='cuda:0')]
train: [0.4223537890625, tensor(0.8611, device='cuda:0')]
valid: [0.218360595703125, tensor(0.9318, device='cuda:0')]

Process looks like this:

  • Initialize neural net with the usual technique, then we pass a batch through the model and check the outputs of the linear and convolutional layers.
  • Rescale the weights according to the actual variance we observe on the activations, and subtract the mean we observe from the initial bias. That way we will have activations that stay normalized.
In [11]:
learn, run = get_learn_run(data, nfs, ConvLayer, 0.4, cbs=callbacks)

Helper function to get one batch from the dataloader and calls the passed callbacks on it.

In [12]:
def get_batch(dl, run):
    run.xb, run.yb = next(iter(dl))
    for cb in cb.set_runner(run)
    return run.xb, run.yb
In [13]:
xb, yb = get_batch(, run)

Iterating through the model and selecting the linear layers - not Relu or Adaptive Pool.

In [14]:
def find_mods(m, func):
    if func(m): return [m]
    return sum([find_mods(o, func) for o in m.children()], [])

def is_lin_layer(l):
    layers = (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear, nn.ReLU)
    return isinstance(l, layers)
In [15]:
mods = find_mods(learn.model, lambda o: isinstance(o, ConvLayer))
In [16]:
   (conv): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
   (relu): GeneralRelu()
   (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
   (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
   (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
   (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
In [17]:
def lsuv_append_stat(hook, mod, inp, outp):
    d =
    hook.mean, hook.std = d.mean().item(), d.std().item()
In [18]:
mdl = learn.model.cuda()
In [20]:
with Hooks(mods, lsuv_append_stat) as hooks:
    for hook in hooks: print(hook.mean, hook.std)
0.07416391372680664 0.2242083102464676
0.0548112615942955 0.1576429158449173
0.05677511915564537 0.11721718311309814
0.07732177525758743 0.11247489601373672
0.042910635471343994 0.06787385791540146
In [23]:
def lsuv_module(m, xb):
    h = Hook(m, lsuv_append_stat)
    while mdl(xb) is not None and abs(h.mean ) > 1e-3: m.bias -= h.mean
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: /= h.std
    return h.mean, h.std

Mean is slightly too high. But unit variance all the way through.

In [24]:
for m in mods: print(lsuv_module(m, xb))
(0.25661736726760864, 1.0)
(0.11878962069749832, 0.9999999403953552)
(0.15285749733448029, 0.9999998211860657)
(0.0946943610906601, 1.0)
(0.2778244614601135, 1.0)

And it works. It improves the accuracy from our baseline to 96%:

In [25]:
%time, learn)
train: [0.5241593359375, tensor(0.8312, device='cuda:0')]
valid: [0.15398057861328124, tensor(0.9529, device='cuda:0')]
train: [0.150899931640625, tensor(0.9526, device='cuda:0')]
valid: [0.1112196044921875, tensor(0.9657, device='cuda:0')]
Wall time: 2.55 s
In [26]: