In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Layerwise Sequential Unit Variance (LSUV)

A initialization technique for deep architectures

All You need is a Good Init

From the abstract:

Layer-sequential unit-variance (LSUV) initialization – a simple method for weight initialization for deep net learning – is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.
In [2]:
#export
from exp.nb_08 import *

Data, Model and Runner

In [3]:
x_train, y_train, x_valid, y_valid = get_data()

train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)

nh,bs = 50,512
c = y_train.max().item()+1
loss_func = F.cross_entropy

data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)
In [4]:
len(data.train_ds)
Out[4]:
50000
In [5]:
mnist_view = view_tfm(1,28,28)
In [6]:
callbacks = [
    Recorder,
    partial(AvgStatsCallback, accuracy),
    partial(BatchTransformXCallback, mnist_view),
    CudaCallback
]
In [7]:
nfs = [8,16,32,64,64]

Let's refactor our conv_layer with two new properties which will make it easier to apply the LSUV algorithm:

In [8]:
class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=0., **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True)
        self.relu = GeneralRelu(sub=sub, **kwargs)
        
    def forward(self, x):
        return self.relu(self.conv(x))
    
    @property
    def bias(self): return -self.relu.sub
    
    @bias.setter
    def bias(self, v): self.relu.sub = -v
    
    @property
    def weight(self): return self.conv.weight
In [9]:
learn, run = get_learn_run(data, nfs, ConvLayer, 0.6, cbs=callbacks)

Here is our baseline without initializing the weights with LSUV:

In [10]:
run.fit(2, learn)
train: [1.93039625, tensor(0.3342, device='cuda:0')]
valid: [0.90568271484375, tensor(0.6834, device='cuda:0')]
train: [0.4223537890625, tensor(0.8611, device='cuda:0')]
valid: [0.218360595703125, tensor(0.9318, device='cuda:0')]

Process looks like this:

  • Initialize neural net with the usual technique, then we pass a batch through the model and check the outputs of the linear and convolutional layers.
  • Rescale the weights according to the actual variance we observe on the activations, and subtract the mean we observe from the initial bias. That way we will have activations that stay normalized.
In [11]:
learn, run = get_learn_run(data, nfs, ConvLayer, 0.4, cbs=callbacks)

Helper function to get one batch from the dataloader and calls the passed callbacks on it.

In [12]:
#export 
def get_batch(dl, run):
    run.xb, run.yb = next(iter(dl))
    for cb in run.cbs: cb.set_runner(run)
    run('begin_batch')
    return run.xb, run.yb
In [13]:
xb, yb = get_batch(learn.data.train_dl, run)

Iterating through the model and selecting the linear layers - not Relu or Adaptive Pool.

In [14]:
#export
def find_mods(m, func):
    if func(m): return [m]
    return sum([find_mods(o, func) for o in m.children()], [])

def is_lin_layer(l):
    layers = (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear, nn.ReLU)
    return isinstance(l, layers)
In [15]:
mods = find_mods(learn.model, lambda o: isinstance(o, ConvLayer))
In [16]:
mods
Out[16]:
[ConvLayer(
   (conv): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 )]
In [17]:
#export
def lsuv_append_stat(hook, mod, inp, outp):
    d = outp.data
    hook.mean, hook.std = d.mean().item(), d.std().item()
In [18]:
mdl = learn.model.cuda()
In [20]:
with Hooks(mods, lsuv_append_stat) as hooks:
    mdl(xb)
    for hook in hooks: print(hook.mean, hook.std)
0.07416391372680664 0.2242083102464676
0.0548112615942955 0.1576429158449173
0.05677511915564537 0.11721718311309814
0.07732177525758743 0.11247489601373672
0.042910635471343994 0.06787385791540146
In [23]:
#export
def lsuv_module(m, xb):
    h = Hook(m, lsuv_append_stat)
    
    while mdl(xb) is not None and abs(h.mean ) > 1e-3: m.bias -= h.mean
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
        
    h.remove()
    return h.mean, h.std

Mean is slightly too high. But unit variance all the way through.

In [24]:
for m in mods: print(lsuv_module(m, xb))
(0.25661736726760864, 1.0)
(0.11878962069749832, 0.9999999403953552)
(0.15285749733448029, 0.9999998211860657)
(0.0946943610906601, 1.0)
(0.2778244614601135, 1.0)

And it works. It improves the accuracy from our baseline to 96%:

In [25]:
%time run.fit(2, learn)
train: [0.5241593359375, tensor(0.8312, device='cuda:0')]
valid: [0.15398057861328124, tensor(0.9529, device='cuda:0')]
train: [0.150899931640625, tensor(0.9526, device='cuda:0')]
valid: [0.1112196044921875, tensor(0.9657, device='cuda:0')]
Wall time: 2.55 s
In [26]:
nb_auto_export()