%reload_ext autoreload
%autoreload 2
%matplotlib inline

Mini Batch Training¶

Basic training loop, Parameters, and Optim

We're going to start training!

#export
from exp.nb_02 import *

import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler

Data¶

x_train,y_train,x_valid,y_valid = get_data()

x_train.shape

torch.Size([50000, 784])

n, m = x_train.shape

c = y_train.max()+1;c

tensor(10)

Basic Model and Preds¶

nh = 50

class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in, nh), nn.ReLU(), nn.Linear(nh, n_out)]
    def __call__(self, x):
        for l in self.layers: x = l(x)
        return x

model = Model(m, nh, 10)

pred = model(x_train)

Loss Function: Cross Entropy Loss¶

Each grayscale image of a handwritten shape in our dataset has exactly 1 correct answer - an integer between 0 and 9 - these are called the labels or targets.

Our y_train is a tensor of integers that map to the x_train images.

We can index into the y_train to see the labels to images 0, 1, and 2.

y_train[:3]

tensor([5, 0, 4])

We could also think of these labels as one-hot encoded vectors of length 10 where the label corresponds to a 1 at the label's index and everything else is a zero.

y0 = torch.zeros(10)
y0[y_train[0]] = 1
y0

tensor([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.])

The model's output is simply a length 10 vector for every example that is the result of numerous matrix multiplications:

pred[0]

tensor([ 0.0918,  0.0414,  0.1069, -0.1664,  0.0340,  0.1779,  0.0468, -0.0298,
        -0.0210, -0.1646], grad_fn=<SelectBackward>)

Softmax¶

What we would like is a probability distribution over each of our 10 classes: each class gets a probabilty, the highest corresponds to the class the model has learned is the most "correct". The model most therefore learn and adjust its parameters by quantifying how wrong its guess was.

In order to accomplish this for our multi-class problem we'll use cross entropy loss.

The first step is to scale the outputs by putting them through a softmax function:

$$\hbox{softmax(x)}_{i} = \frac{e^{x_{i}}}{\sum_{0 \leq j \leq n-1} e^{x_{j}}}$$

This turns our length 10 output vector into a probability distribution.

For example, for the numerator, to raise the first two rows to the e is just:

pred[:2].exp()

tensor([[1.0962, 1.0423, 1.1129, 0.8467, 1.0346, 1.1947, 1.0479, 0.9706, 0.9792,
         0.8483],
        [1.0869, 1.0400, 1.1389, 0.9027, 1.0509, 1.3027, 1.1103, 1.0629, 1.0749,
         0.8000]], grad_fn=<ExpBackward>)

The denominator is trickier because we don't want to sum all of the rows together. We need divide each exponentiated value by its own row.

Therefore, this won't work because it lumps everything together:

pred[:2].exp().sum()

tensor(20.7437, grad_fn=<SumBackward0>)

Instead we need to pass keepdim=True

pred[:2].exp().sum(-1, keepdim=True)

tensor([[10.1734],
        [10.5703]], grad_fn=<SumBackward1>)

Finally, here are the first two rows softmaxed:

soft_preds = pred[:2].exp() / pred[0].exp().sum(-1, keepdim=True); soft_preds

tensor([[0.1077, 0.1025, 0.1094, 0.0832, 0.1017, 0.1174, 0.1030, 0.0954, 0.0963,
         0.0834],
        [0.1068, 0.1022, 0.1119, 0.0887, 0.1033, 0.1281, 0.1091, 0.1045, 0.1057,
         0.0786]], grad_fn=<DivBackward0>)

And if we sum a row we get 1:

soft_preds[0].sum()

tensor(1., grad_fn=<SumBackward0>)

Here is our function:

def log_softmax(x): return (x.exp() / (x.exp().sum(-1, keepdim=True))).log()

soft_preds1 = log_softmax(pred)

We can immediately refactor this by remembering that:

$$ \log{\frac{a}{b}} = \log{a} - \log{b} $$

Therefore:

$$ \displaystyle \log{\frac{e^x_i}{\sum_{j=0}^{n-1} e^x_j}} = \log{e^x} - \log{\sum_{j=0}^{n-1} e^x_j} = x - \log{\sum_{j=0}^{n-1} e^x_j} $$

In code this is:

def log_softmax(x): return x - x.exp().sum(-1,keepdim=True).log()

soft_preds = log_softmax(pred)

test_near(soft_preds, soft_preds1)

Now that we have an output vector of predictions, $\hat{y}$, in the form of a probability distribution over the possible classes of $y$ (0-9) we can use cross entropy loss to caculate just how far off our prediction is from the target value.

We're trying to find how dissimilar our prediction is to the target. So we are comparing the two distributions.

Assuming our $y$ is one-hot encoded, we calculate the cross entropy loss for a single example by taking the dot product of the two vectors:

$$ L = -y \cdot \log{\hat{y}} $$

-(y0 @ soft_preds[0])

tensor(2.1419, grad_fn=<NegBackward>)

But we can do this another way without having to one-hot encode our labels.

We'll use integer array indexing - we can pass a list of integers for each dimension and get back those specific rows/columns.

-soft_preds[[0],[y_train[0]]]

tensor([2.1419], grad_fn=<NegBackward>)

Now we turn that into a proper loss function which takes averages all of the negative loss logs over the entire output.

def nll(input, target): return -input[range(target.shape[0]), target].mean()

loss = nll(soft_preds, y_train); loss

tensor(2.3071, grad_fn=<NegBackward>)

LogSumExp Trick¶

Then, there is a way to compute the log of the sum of exponentials in a more stable way, called the LogSumExp trick. The idea is to use the following formula:

$$\log \left ( \sum_{j=1}^{n} e^{x_{j}} \right ) = \log \left ( e^{a} \sum_{j=1}^{n} e^{x_{j}-a} \right ) = a + \log \left ( \sum_{j=1}^{n} e^{x_{j}-a} \right )$$

where a is the maximum of the $x_{j}$.

def logsumexp(x):
    m = x.max(-1)[0] # grab the largest number in x
    print(m)
    return m + (x-m[:,None]).exp().sum(-1).log() # subtract it out and then add it back in the end

Pytorch F.cross_entropy¶

We've now built our loss function so let's go ahead and use the Pytorch version which combines log_softmax and nll_loss in a single function.

pyloss = F.cross_entropy(pred, y_train); pyloss

tensor(2.3070, grad_fn=<NllLossBackward>)

test_near(pyloss, loss)

It works!

Basic Training Loop¶

We have all the parts now to create an algorithm!

The training loop is combines everything we have done so far into an interative process. We loop over the data again and again to fine-tune our model's parameters.

Here is what we need the training loop to do:

Get a batch of inputs and pass them to the model to get a batch of outputs
Compute the loss of by comparing the outputs to the labels
Calculate the gradients of the loss function with respect to the model parameters
Finally update the parameters using those gradients and a learning rate

Single Batch¶

loss_func = F.cross_entropy

It would be nice if we had some sort of metric to follow to see how many of the training examples we are getting correct.

We'll start with accuracy.

torch.argmax(pred, dim=1)[:4]

tensor([5, 5, 5, 6])

#export
def accuracy(yh, y): return (torch.argmax(yh, dim=1)==y).float().mean()

Let's create a single mini-batch to test that it works:

bs=128                  # batch size
xb = x_train[0:bs]     # a mini-batch from x
preds = model(xb)      # predictions

This will give us 128 predictions - each prediction here being a vector of length 10.

preds[0], preds.shape

(tensor([ 0.0918,  0.0414,  0.1069, -0.1664,  0.0340,  0.1779,  0.0468, -0.0298,
         -0.0210, -0.1646], grad_fn=<SelectBackward>),
 torch.Size([128, 10]))

Now we can put these predictions through our loss function with our labels and get some sort of measurement as to how far off they are:

yb = y_train[0:bs]
loss = loss_func(preds, yb); loss

tensor(2.3045, grad_fn=<NllLossBackward>)

accuracy(preds, yb)

tensor(0.0469)

About 10% accuracy, that's basically choosing randomly.

We have a long way to go but at least everything appears to be working.

Now let's look at our model's weights:

model.layers[0].weight

Parameter containing:
tensor([[ 0.0302,  0.0338, -0.0123,  ..., -0.0202,  0.0214,  0.0049],
        [-0.0146, -0.0077,  0.0349,  ...,  0.0049,  0.0070,  0.0019],
        [ 0.0141,  0.0241, -0.0318,  ..., -0.0278,  0.0213, -0.0353],
        ...,
        [-0.0132, -0.0237, -0.0061,  ...,  0.0226, -0.0165, -0.0246],
        [ 0.0126, -0.0019,  0.0001,  ...,  0.0199,  0.0091, -0.0242],
        [ 0.0325, -0.0055, -0.0159,  ...,  0.0244,  0.0131,  0.0208]],
       requires_grad=True)

model.layers[0].weight.grad

At this stage the gradients with respect to the loss have not been computed.

Pytorch only computes them once .backward is called. Let's do that:

loss.backward()

model.layers[0].weight.grad[0][:100]

tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.6134e-05,
        3.1168e-05, 2.7501e-05, 1.4667e-06, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.7038e-05, 1.6612e-04,
        3.3224e-04, 4.4549e-04, 3.6706e-04, 1.2165e-04])

Training Loop¶

Let's finally begin to train.

Three hyperparameters we need to set are the batch size, the learning rate and the number of epochs (the number of times we iterate through the entire dataset)

bs = 64
lr = 0.5
epochs = 1

for epoch in range(epochs):
    for i in range(0,n,bs):
        xb = x_train[i:i+bs]
        yb = y_train[i:i+bs]
        loss = loss_func(model(xb), yb)
        loss.backward()
        
        with torch.no_grad():
            for l in model.layers:
                if hasattr(l, 'weight'):
                    l.weight -= lr * l.weight.grad
                    l.bias   -= l.bias.grad   * lr
                    l.weight.grad.zero_()
                    l.bias  .grad.zero_()

loss_func(model(xb), yb)

tensor(0.1821, grad_fn=<NllLossBackward>)

And it's >90% accuracy. Not terrible for a simple neural net.

accuracy(model(x_train[:512]), y_train[:512])

tensor(0.9316)

Parameters¶

We'll start by no longer treating our Relu as a separate layer. Instead we'll use F.relu which is the functional form that returns activations.

We then have two linear layers from nn these linear layers are automatically registered by the nn.Module class as the parameters of the model.

We can call model.parameters() now and it will return a generator that does essentially what we were doing manually by iterating through the list of layers and checking for the weight attribute.

nh = 64

class Model(nn.Module):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.l1 = nn.Linear(n_in, nh)
        self.l2 = nn.Linear(nh, n_out)
        
    def forward(self, x):
        return self.l2(F.relu(self.l1(x)))

n,m

(50000, 784)

model = Model(m, 10)

Let's take a look inside our model.

We can do this by calling the .named_children method on the model.

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

for l in model.named_children(): print(f"{l[0]} : {l[1]}")

l1 : Linear(in_features=784, out_features=64, bias=True)
l2 : Linear(in_features=64, out_features=10, bias=True)

Pytorch's nn.Module has __repr__ defined as the following:

def __repr__(self):
        # We treat the extra repr like the sub-module, one item per line
        extra_lines = []
        extra_repr = self.extra_repr()
        # empty string will be split into list ['']
        if extra_repr:
            extra_lines = extra_repr.split('\n')
        child_lines = []
        for key, module in self._modules.items():
            mod_str = repr(module)
            mod_str = _addindent(mod_str, 2)
            child_lines.append('(' + key + '): ' + mod_str)
        lines = extra_lines + child_lines

        main_str = self._get_name() + '('
        if lines:
            # simple one-liner info, which most builtin Modules will use
            if len(extra_lines) == 1 and not child_lines:
                main_str += extra_lines[0]
            else:
                main_str += '\n  ' + '\n  '.join(lines) + '\n'

        main_str += ')'
        return main_str`

model

Model(
  (l1): Linear(in_features=784, out_features=64, bias=True)
  (l2): Linear(in_features=64, out_features=10, bias=True)
)

Fit¶

Let's create a function that we can call which will run our training loop.

This is standard for Machine Learning libraries like Sklearn.

def fit():
    print("Training...")
    for epoch in range(epochs):
        for i in range(0,n,bs):
            end = i+bs if i+bs < n else n
            xb = x_train[i:end]
            yb = y_train[i:end]
            preds = model(xb)
            loss = loss_func(preds, yb)
            
            loss.backward()
            
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                    model.zero_grad()

fit()

Training...

loss_func(model(xb),yb)

tensor(0.3917, grad_fn=<NllLossBackward>)

accuracy(model(x_valid), y_valid)

tensor(0.9116)

Question: How does Pytorch know what attributes in __init__ to set as the model parameters?

In Python every time an attribute is assigned during a class initialization, __setattr__() is called.

When we inherit from nn.Module and then execute super().__init__() Pytorch creates the following 'private' attributes:

self.training = True
        self._parameters = OrderedDict()
        self._buffers = OrderedDict()
        self._backward_hooks = OrderedDict()
        self._forward_hooks = OrderedDict()
        self._forward_pre_hooks = OrderedDict()
        self._state_dict_hooks = OrderedDict()
        self._load_state_dict_pre_hooks = OrderedDict()
        self._modules = OrderedDict()

When the model is instantiated from the Model class and self.l1 = nn.Linear is set as in attribute __setattr__() is called.

Pytorch then does the following:

checks if the attribute is a Parameter or a Module
checks to make sure that nn.Module __init__ was called
then registers the Parameter or (sub)Module

class Parameter

    '''A kind of Tensor that is to be considered a module parameter.

    Parameters are ~torch.Tensor subclasses, that have a very special property when used with Module s - when they're assigned as Module attributes they are automatically added to the list of its parameters, and will appear e.g. in ~Module.parameters iterator. Assigning a Tensor doesn't have such effect. This is because one might want to cache some temporary state, like last hidden state of the RNN, in the model. If there was no such class as Parameter, these temporaries would get registered too.

Arguments: data (Tensor): parameter tensor. requires_grad (bool, optional): if the parameter requires gradient. See excluding-subgraphs for more details. Default: True'''

class Module

    '''Base class for all neural network modules.

    Your models should also subclass this class.

    Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes:'''

We can demonstrate this by building a dummy module that has a dictionary called _modules

class DummyModule():
    def __init__(self, n_in, n_out):
        self._modules = {}
        self.l1 = nn.Linear(n_in, nh)
        self.l2 = nn.Linear(nh, n_out)
    
    def __setattr__(self, k,v):
        if not k.startswith("_"): # register any keys that do not start with '_'
            self._modules[k] = v # put it inside modules dict
        super().__setattr__(k,v)
    
    def __repr__(self): return f'{self._modules}'
    
    def parameters(self):
        for l in self._modules.values():
            for p in l.parameters(): yield p

mdl = DummyModule(m, 10)

mdl

{'l1': Linear(in_features=784, out_features=64, bias=True), 'l2': Linear(in_features=64, out_features=10, bias=True)}

[o.shape for o in mdl.parameters()]

[torch.Size([64, 784]),
 torch.Size([64]),
 torch.Size([10, 64]),
 torch.Size([10])]

Registering Modules¶

Now let's say we wanted to use the layers approach that we wrote earlier.

layers = [nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh, 10)]

class Model(nn.Module):
        def __init__(self, layers):
            super().__init__() # sets up the parameters module dict and other dicts
            self.layers = layers
            for i,l in enumerate(self.layers):
                self.add_module(f'l{i}',l)
            
        def forward(self, x):
            for l in self.layers: x = l(x)
            return x

model = Model(layers)

model

Model(
  (l0): Linear(in_features=784, out_features=64, bias=True)
  (l1): ReLU()
  (l2): Linear(in_features=64, out_features=10, bias=True)
)

nn.ModuleList¶

If we insist on using the layers technique we can use the built in Pytorch nn.ModuleList

class Model(nn.Module):
    def __init__(self, layers):
        super().__init__() 
        self.layers = nn.ModuleList(layers)
    def forward(self, x):
        for l in self.layers: x = l(x)
        return x

model = Model(layers)

model

Model(
  (layers): ModuleList(
    (0): Linear(in_features=784, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=10, bias=True)
  )
)

fit()
loss_func(model(xb), yb), accuracy(model(xb), yb)

Training...

(tensor(0.4165, grad_fn=<NllLossBackward>), tensor(0.8750))

nn.Sequential¶

Even easier would be to use the Pytorch nn.Sequential

model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh, 10))

model

Sequential(
  (0): Linear(in_features=784, out_features=64, bias=True)
  (1): ReLU()
  (2): Linear(in_features=64, out_features=10, bias=True)
)

fit()
loss_func(model(xb), yb), accuracy(model(xb), yb)

Training...

(tensor(0.3879, grad_fn=<NllLossBackward>), tensor(0.8750))

Optim¶

Optimizer Step¶

Let's refactor our optimization step.

In our training loop we called backward on the loss to compute the gradients and then to make the actual updates to the weights we did the following:

with torch.no_grad():
    for p in model.parameters(): p -= p.grad * lr
    model.zero_grad()

We can simplify this bit of the loop if we instead put this away into an Optimizer class which will then have two separate methods:

opt.step()
opt.zero_grad()

class Optimizer():
    def __init__(self, model, lr):
        self.model = model
        self.lr = lr
    
    def step(self):        
        with torch.no_grad():
            for p in self.model.parameters():
                p -= p.grad * self.lr
        
    def zero(self):
        self.model.zero_grad()

opt = Optimizer(model, lr)

for epoch in range(epochs):
    for i in range(0,n,bs):
        end = i+bs if i+bs < n else n
        xb = x_train[i:end]
        yb = y_train[i:end]
        loss = loss_func(model(xb), yb)
        loss.backward()
        opt.step()
        opt.zero()

loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc

(tensor(0.2990, grad_fn=<NllLossBackward>), tensor(0.9375))

Pytorch Optim¶

Now let's see the Pytorch version.

optim.SGD.step iterates through each param_group and then again through each group's params key which contain the layers in that group

opt = optim.SGD(model.parameters(), lr=lr)

opt.param_groups

[{'params': [Parameter containing:
   tensor([[ 0.0242, -0.0223,  0.0028,  ..., -0.0136, -0.0251,  0.0132],
           [ 0.0249,  0.0300,  0.0280,  ..., -0.0090,  0.0227, -0.0340],
           [ 0.0076,  0.0300,  0.0173,  ...,  0.0251, -0.0326, -0.0252],
           ...,
           [-0.0334,  0.0330, -0.0164,  ...,  0.0063,  0.0304,  0.0350],
           [-0.0112,  0.0052, -0.0064,  ..., -0.0089,  0.0333,  0.0117],
           [ 0.0170,  0.0224, -0.0226,  ...,  0.0107,  0.0014,  0.0281]],
          requires_grad=True),
   Parameter containing:
   tensor([ 0.0695, -0.1221, -0.2036,  0.0419, -0.4452, -0.0561, -0.0798, -0.1187,
           -0.3741,  0.0679, -0.1101, -0.1773, -0.3565, -0.2370, -0.0692, -0.1396,
           -0.2085, -0.3121, -0.2558, -0.4352, -0.1824, -0.6495, -0.0619, -0.0654,
           -0.5264, -0.1799, -0.5893, -0.1687, -0.0630, -0.0676, -0.0492, -0.2636,
           -0.1226, -0.1610, -0.3492,  0.1063, -0.2487,  0.0685, -0.0983, -0.2556,
           -0.0129, -0.2272, -0.1577, -0.4540,  0.1023, -0.3471, -0.4983, -0.3630,
           -0.3642, -0.3852, -0.0962, -0.6883, -0.0752, -0.3145, -0.3133,  0.1807,
            0.0232, -0.8394, -0.3073, -0.1410, -0.0365, -0.0309, -0.1918, -0.1322],
          requires_grad=True),
   Parameter containing:
   tensor([[-5.8946e-01, -9.9541e-01, -6.1009e-01, -4.2944e-01, -1.7829e+00,
            -6.1810e-01, -6.1268e-01, -1.7368e+00, -1.0910e+00, -2.5996e-01,
            -1.2900e+00, -7.7910e-01, -2.1162e-01, -8.9763e-01, -4.3329e-01,
            -6.6195e-02, -1.7197e+00, -2.4720e-01, -1.7300e+00, -5.7200e-01,
            -3.8469e-01, -7.8608e-01, -5.5976e-01, -6.5411e-01, -1.3015e+00,
            -1.0755e+00, -1.7504e+00, -1.4150e+00, -6.3110e-01, -1.2771e-01,
            -7.4116e-01, -1.3314e+00, -8.8672e-01, -1.3139e+00, -8.2490e-01,
            -2.8461e-01, -4.9979e-01, -6.2796e-01, -1.4646e+00, -9.5640e-01,
            -7.2762e-01, -6.6643e-02, -4.1593e-01, -1.9768e+00, -3.8755e-01,
             1.4441e-01, -1.1516e+00, -1.4967e+00,  4.4487e-02, -2.0674e+00,
            -7.0269e-01, -6.4916e-01, -2.8970e-01, -9.1833e-01, -6.1617e-01,
             1.4324e+00, -5.0992e-01, -6.8946e-01, -7.7389e-01, -6.2576e-01,
            -5.6817e-01, -2.1706e-01, -5.1434e-01, -8.3437e-01],
           [ 1.4736e-01,  6.1899e-01,  3.8367e-01,  2.0474e-01,  1.6234e-01,
             1.9910e-02,  5.2521e-01,  2.0243e-01,  2.6336e-01,  4.8058e-01,
             1.5212e-02,  2.9719e-01,  1.0465e-01,  3.9997e-01,  4.9179e-01,
            -1.0686e-02, -6.3055e-02,  4.4104e-03,  6.9422e-01,  2.3276e-02,
             2.0056e-01,  3.2635e-01, -1.3552e-01,  3.9350e-01,  1.8209e-01,
            -1.5804e-02,  1.9072e-01,  6.3875e-01,  1.3378e-02,  7.5685e-01,
            -2.3055e-01,  2.4656e-01,  5.6902e-03,  2.5361e-01,  4.1348e-01,
             4.6413e-01, -4.2637e-03, -2.1373e-01,  2.5735e-01,  1.3779e-01,
             1.3182e-01, -2.4420e-02,  8.3810e-02,  5.6315e-01, -3.9571e-01,
             9.7058e-02,  7.2392e-02, -4.8051e-02,  6.7091e-02,  1.9506e-01,
             4.7850e-01, -3.3020e-02, -9.9602e-02,  3.5028e-01,  4.6059e-01,
            -7.0152e-01,  4.7127e-01,  1.1147e-01,  2.8727e-01,  2.7892e-01,
            -3.2688e-01,  5.3405e-01, -7.7136e-02,  2.4380e-01],
           [ 4.0262e-01,  5.9595e-01, -4.2393e-01,  4.8300e-01, -1.3116e-01,
             9.7536e-03,  9.1390e-02, -2.7039e-01,  1.5303e-01, -5.3912e-01,
            -4.6389e-01, -4.1841e-01, -4.4547e-01, -1.1236e-01, -3.9071e-01,
             9.8153e-02, -4.9088e-01, -6.0300e-01, -6.4100e-01, -4.9705e-01,
             3.4840e-01, -1.1200e+00, -2.0423e-01,  1.0177e+00, -9.5590e-01,
            -4.3663e-01, -1.1467e+00, -5.8674e-02, -2.6659e-01, -1.6858e-01,
             2.0244e-02, -2.0337e-01,  1.6642e-01, -3.7115e-02, -6.8630e-01,
             8.4875e-02, -6.8238e-02,  9.7729e-01,  1.4652e-01,  5.0473e-01,
            -4.2074e-01, -4.3144e-01, -2.6734e-01, -5.8175e-01, -2.7036e-02,
            -6.0768e-01, -9.1842e-01, -5.3203e-01, -4.9638e-01, -4.7890e-02,
             1.8399e-01, -9.6529e-01, -2.9659e-02, -6.1856e-01, -5.7741e-01,
            -6.6060e-02,  2.8556e-01, -1.2753e+00, -5.6649e-01,  5.9104e-01,
             4.1077e-01, -4.5074e-02,  3.8024e-01, -4.5037e-01],
           [-3.4428e-01, -1.0206e-01,  2.9607e-01, -2.7795e-01,  1.2728e-01,
            -1.9542e-02,  1.0291e-01,  3.2123e-01,  5.2703e-01,  2.4541e-01,
             1.4009e-01,  7.8937e-01,  4.4860e-01,  4.9172e-01, -9.6546e-02,
             6.3077e-02,  3.0765e-01, -1.3423e-02,  2.5715e-01,  3.6962e-01,
            -2.0634e-01,  7.7443e-01,  6.7656e-01,  5.8094e-01,  7.5473e-01,
             3.5015e-01,  5.9438e-01, -6.5608e-02, -3.5443e-01, -4.7868e-01,
            -5.5013e-02,  1.6557e-01,  3.7571e-01,  3.1492e-01,  6.6933e-01,
            -4.0002e-01, -6.3331e-03,  1.8645e-01, -3.0906e-01,  3.5314e-01,
             1.1049e-02,  6.4653e-01,  2.7413e-01,  1.9415e-01,  3.1959e-01,
             2.0268e-01,  7.6408e-02,  3.0792e-01,  2.5554e-01,  6.0842e-01,
            -8.5416e-02,  9.1566e-01,  2.6886e-02,  1.5939e-02,  3.4539e-01,
            -1.2341e-01,  4.3908e-02,  8.9834e-01,  6.8286e-01,  5.7730e-01,
             5.5149e-01,  1.5502e-01, -1.0673e-01, -5.5822e-02],
           [ 7.0161e-01, -2.3185e-01,  6.5047e-04, -2.0168e-01,  5.5600e-01,
            -4.4161e-01, -1.6711e-01,  3.2967e-01, -1.3243e-01, -2.8431e-01,
            -9.2141e-02,  8.8790e-02,  6.1324e-02,  3.6411e-01,  4.3315e-02,
             1.8284e-02,  5.3717e-01,  1.3777e-01,  1.6545e-01,  3.1119e-01,
             2.7656e-01,  6.6881e-02, -2.6726e-01, -4.3709e-01,  4.8675e-01,
             6.8670e-02,  7.0873e-01, -1.3129e-01,  7.4835e-01,  3.6377e-02,
             4.3994e-01,  4.1828e-01,  3.0104e-02,  4.7315e-01, -1.7206e-02,
            -2.2307e-01,  4.4726e-01,  5.2594e-01,  5.9244e-01, -3.3652e-01,
             4.2258e-01, -3.2600e-01, -1.1081e-01,  4.4143e-01, -8.5929e-01,
            -2.9924e-03,  6.4564e-01,  1.1503e+00, -1.4358e-02,  1.8174e-01,
             4.0991e-03,  3.5616e-01, -5.7668e-01,  2.3412e-01, -6.7599e-02,
            -9.3872e-02,  5.4544e-01,  6.1371e-02,  7.4929e-02, -1.0204e-01,
            -1.2033e-01,  1.0361e-02,  4.8076e-02, -4.3451e-01],
           [-2.8586e-01, -1.0493e-01,  1.5313e-01,  5.1876e-01, -2.7053e-02,
             2.3112e-01, -2.2397e-02, -4.4221e-01,  1.8422e-01,  1.0377e+00,
             6.4909e-01,  8.7938e-01,  2.5460e-01, -8.8508e-01,  1.6931e-01,
            -2.4048e-02, -1.1946e-01,  2.0680e-01,  3.1130e-01,  6.5575e-02,
            -5.3606e-01,  4.9214e-01,  9.3470e-01, -1.2098e-01,  3.6280e-01,
            -9.6916e-02,  4.7032e-01,  1.7976e-01, -4.9625e-02, -2.8532e-02,
            -6.5478e-01,  1.2713e-01,  1.4377e-01,  1.0490e-01,  2.0465e-01,
             9.6936e-01, -9.6575e-04,  4.1975e-02, -1.5567e-02, -5.7961e-01,
             4.5192e-01, -4.6537e-02,  1.0483e-01,  5.3052e-01,  8.8192e-01,
             5.4756e-03,  2.5424e-01,  2.8760e-01, -5.3971e-03,  2.0757e-01,
             3.2294e-02,  8.1343e-02,  6.6076e-02,  4.6617e-01,  6.1208e-02,
            -3.6865e-01, -2.9696e-01,  5.9388e-01,  3.7265e-01, -2.7830e-01,
             1.9883e-01,  1.6172e-01, -2.1520e-03,  9.0075e-01],
           [-9.1282e-02,  4.0880e-01,  8.2211e-02,  3.0096e-01,  3.7115e-01,
            -9.8904e-02, -2.1929e-01, -1.0130e-01, -3.6477e-01,  1.3295e-02,
            -4.1978e-02, -4.8751e-01,  2.0952e-01, -2.7447e-01,  2.2419e-01,
            -8.7202e-02,  5.7381e-01,  3.2917e-01,  4.6564e-01,  5.6258e-02,
            -2.4680e-02,  3.2448e-01, -3.9930e-01, -3.8917e-01,  1.1917e-01,
             6.8916e-01,  3.6015e-01,  5.6923e-01,  2.6191e-01, -9.5089e-02,
             1.0289e-01,  7.6509e-01,  3.7471e-01, -5.8906e-02,  3.1100e-02,
             7.6729e-02,  5.6047e-01, -1.4122e-01,  1.0862e+00, -1.5297e-01,
            -6.3105e-01,  1.3481e-01,  1.4453e-01,  8.2898e-01, -3.8684e-01,
            -3.4918e-02,  6.7828e-01, -1.3000e-02,  1.0571e-01,  7.2519e-01,
             1.2002e-01,  9.8705e-02,  9.7753e-01,  7.2115e-01, -5.7575e-02,
             1.0517e-01,  6.6588e-02,  5.9058e-01, -4.9168e-02, -2.9221e-01,
             1.8939e-01,  6.1547e-02,  5.0224e-01,  7.5143e-01],
           [ 4.1170e-01,  1.8943e-01,  8.4494e-02, -4.3106e-02,  3.9995e-01,
             9.0398e-01,  3.0760e-01,  6.2089e-01, -3.3350e-01, -4.0979e-01,
             1.1549e+00, -1.8415e-01,  7.0240e-02,  3.7650e-01,  7.4481e-02,
             9.1855e-02,  1.1725e+00, -6.7355e-02,  7.0285e-01,  6.1826e-01,
             2.8183e-01,  4.5938e-02,  1.9428e-01,  7.8283e-01,  5.6588e-01,
             4.8374e-01,  3.0270e-01,  3.0356e-02,  4.4761e-01,  9.5450e-02,
             3.5232e-01, -1.4289e-02,  2.1719e-01,  4.4987e-01,  1.1990e-01,
             6.2204e-03, -4.1448e-02, -1.6078e-01,  3.0546e-02,  3.2224e-01,
            -2.7919e-01,  3.4560e-01,  5.0986e-02, -6.4437e-02,  5.2357e-01,
            -2.4402e-02,  5.2329e-01,  5.7898e-01,  5.6708e-01, -3.8952e-02,
            -6.0671e-02,  3.7673e-01, -2.2042e-02,  5.4958e-02,  4.3768e-01,
            -3.5846e-01, -2.2302e-02,  2.4984e-01,  3.2075e-01, -1.8855e-01,
             1.9209e-01, -1.0522e-01,  2.1145e-01,  1.6874e-01],
           [-8.1021e-01,  2.2114e-01,  1.8591e-01, -8.1978e-02,  4.9352e-01,
            -1.9985e-01,  6.9653e-02, -4.3725e-02,  8.2376e-01, -6.6659e-01,
            -1.8166e-02, -1.3726e-01,  2.5030e-01,  1.6389e-02,  7.7665e-03,
            -1.5684e-02,  9.6094e-03, -3.1759e-02,  2.6580e-01,  1.8628e-01,
             5.9159e-01,  6.8857e-01, -5.3471e-01, -4.5519e-01,  3.4545e-01,
             7.2116e-02,  6.4954e-01,  3.2246e-01, -2.8353e-01, -1.6343e-01,
            -3.6800e-01,  3.6197e-01,  2.6918e-03,  1.8404e-01,  3.9387e-01,
            -6.0155e-01, -6.1931e-02, -4.9957e-01, -1.6377e-01,  5.7892e-01,
             7.9558e-01,  2.6996e-01,  2.1753e-01,  4.7850e-01,  5.7942e-02,
             2.3656e-01,  4.3513e-01,  1.6962e-01,  2.4515e-02,  5.4271e-01,
             3.2875e-01,  3.7127e-01, -1.6970e-01,  1.6154e-01,  2.3152e-01,
             9.4109e-02, -4.7913e-01,  4.6616e-01,  4.3372e-01, -4.2300e-02,
            -8.9089e-02, -1.3694e-01,  3.9908e-01, -1.6736e-01],
           [ 9.2075e-01, -5.6832e-02, -2.8283e-01, -6.8762e-01, -2.2668e-01,
             3.4088e-03, -2.3074e-01,  1.0204e+00,  2.2800e-01,  4.2704e-01,
            -3.4855e-01,  8.0587e-02, -1.7857e-01,  3.5792e-01, -1.3834e-01,
            -2.4196e-01, -3.3001e-01, -2.1850e-01, -2.4889e-01, -5.5780e-01,
            -3.3461e-01, -4.5469e-01,  4.0390e-01, -4.9752e-01, -3.5803e-01,
            -2.7136e-01, -5.5564e-01, -2.8440e-01,  2.3450e-01,  1.3883e-01,
             7.6729e-01, -3.0211e-01,  3.9496e-02, -3.4222e-01, -3.5777e-01,
            -2.4313e-02, -1.4975e-01, -2.1483e-01, -9.4760e-02,  5.4784e-02,
             3.8679e-01, -2.7771e-01, -2.2867e-01, -6.5055e-01,  1.1780e-01,
            -3.0037e-01, -3.5958e-01, -5.4361e-01, -3.9687e-01, -1.7567e-01,
            -2.7514e-01, -5.5101e-01, -4.3651e-01, -2.8556e-01, -3.7651e-01,
             1.3694e-01, -3.5114e-01, -6.7469e-01, -4.7721e-01, -2.8824e-01,
            -2.7694e-01,  6.9140e-03, -6.8546e-01, -2.0507e-01]],
          requires_grad=True),
   Parameter containing:
   tensor([ 1.3408, -0.5579, -0.0957, -0.4227, -0.1308,  0.3742, -0.3239, -0.4748,
            0.1084,  0.1185], requires_grad=True)],
  'lr': 0.5,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

Let's make a quick function to get model and an optimizer:

#export
def get_model():
    model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, 10))
    return model, optim.SGD(model.parameters(), lr=lr)

model, opt = get_model()

for epoch in range(epochs):
    for i in range(0,n,bs):
        end = i+bs if i+bs < n else n
        xb=x_train[i:end]
        yb=y_train[i:end]
        preds = model(xb)
        loss = loss_func(preds, yb)
        loss.backward()
        opt.step()
        opt.zero_grad()

vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc

(tensor(0.1690, grad_fn=<NllLossBackward>), tensor(0.9511))

Dataset and DataLoader¶

Dataset¶

Another part of our training loop we can improve is:

for i in range(0,n,bs):
        end = i+bs if i+bs < n else n
        xb=x_train[i:end]
        yb=y_train[i:end]

Let's build a Dataset class that will hold the x's and y's in one object.

xb, yb = train_ds[i:end]

#export
class Dataset():
    def __init__(self, x, y):
        self.x, self.y = x,y
    
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self,key): 
        return self.x[key], self.y[key]

train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)

assert len(train_ds) == len(x_train)

xb,yb = train_ds[0:5]
assert xb.shape==(5,28*28)
assert yb.shape==(5,)
xb,yb

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([5, 0, 4, 1, 9]))

model, opt = get_model()

for epoch in range(epochs):
    for i in range(0,n,bs):
        end = i+bs if i+bs < n else n
        xb, yb = train_ds[i:end]
        loss = loss_func(model(xb), yb)
        loss.backward()
        opt.step()
        opt.zero_grad()

vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc

(tensor(0.2110, grad_fn=<NllLossBackward>), tensor(0.9360))

DataLoader from Scratch¶

And we can use the same refactoring logic to make it so we can simply pull x and y batches out of a DataLoader class that holds the Dataset class we just made.

for xb, yb in train_dl:
    preds = model(xb)
    loss = loss_func(preds, yb)

class DataLoader():
    def __init__(self, ds, bs):
        self.ds = ds
        self.bs = bs

    def __len__(self): 
        return len(self.ds/self.bs)
    
    def __iter__(self):
        for i in range(0, len(self.ds), self.bs): 
            yield self.ds[i:i+self.bs]

train_dl = DataLoader(train_ds, 64)
valid_dl = DataLoader(valid_ds, 128)

xb, yb = next(iter(valid_dl))

assert xb.shape == (128, 28*28)

plt.imshow(xb[0].view(28, 28))

<matplotlib.image.AxesImage at 0x26d50eaa430>

yb[0]

tensor(3)

model, opt = get_model()

def fit():
    for epoch in range(epochs):
        for xb, yb in train_dl:
            loss = loss_func(model(xb), yb)
            loss.backward()
            opt.step()
            opt.zero_grad()

fit()

vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc

(tensor(0.1685, grad_fn=<NllLossBackward>), tensor(0.9529))

Random Sampling v1¶

For certain datasets where the dependent variable is in a specific order we would want to shuffle the data before loading it into batches and putting it through the model.

Every epoch we'll iterate through the entire dataset randomly. The model will get to see each training example once but the order will be different each time.

To do this we need our Dataloader to send a batch size of random integers that are within the range of the dataset.

We'll need permutations:

torch.randperm(10)

tensor([3, 7, 0, 4, 5, 2, 6, 1, 8, 9])

class DataLoader():
    def __init__(self, ds, bs, shuffle=False):
        self.ds = ds
        self.bs = bs
        self.shuffle = shuffle
    
    def __iter__(self):
        if self.shuffle:
            perms = torch.randperm(len(self.ds))
            for i in range(0, len(perms), bs): yield self.ds[perms[i:self.bs]]
        else:
            for i in range(0, len(self.ds), self.bs): yield self.ds[i:i+self.bs]

test_dl = DataLoader(train_ds, 128, shuffle=True)
test_valid_dl = DataLoader(valid_ds, 128, shuffle=False)

xb, yb = next(iter(test_dl))
plt.imshow(xb[0].view(28,28))

<matplotlib.image.AxesImage at 0x26d59a1c1c0>

xb, yb = next(iter(test_dl))
plt.imshow(xb[0].view(28,28))

<matplotlib.image.AxesImage at 0x26d59a63280>

xb, yb = next(iter(test_dl))
plt.imshow(xb[0].view(28,28))

<matplotlib.image.AxesImage at 0x26d59abc850>

model, opt = get_model()
fit()

vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc

(tensor(0.1671, grad_fn=<NllLossBackward>), tensor(0.9487))

Random Sampler v2¶

class Sampler():
    def __init__(self, ds, bs, shuffle=False):
        self.n = len(ds) # just the length of the dataset not the whole dataset
        self.bs = bs
        self.shuffle = shuffle
        
    def __iter__(self):
        self.idxs = torch.randperm(self.n) if self.shuffle else torch.arange(self.n)
        for i in range(0, self.n, self.bs): yield self.idxs[i:i+self.bs]

small_ds = Dataset(*train_ds[:10])

To test this out let's see when shuffle=False

s = Sampler(small_ds, 3, False)
[x for x in s]

[tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([9])]

s = Sampler(small_ds, 3, True)
[x for x in s]

[tensor([7, 1, 4]), tensor([9, 2, 0]), tensor([8, 5, 6]), tensor([3])]

def collate(b):
    xs, ys = zip(*b)
    return torch.stack(xs), torch.stack(ys)

class DataLoader():
    def __init__(self, ds, sampler, collate_fn=collate):
        self.ds = ds
        self.sampler = sampler
        self.collate_fn = collate_fn
    
    def __iter__(self):
        for s in self.sampler: yield self.collate_fn([self.ds[i] for i in s])

train_samp = Sampler(train_ds, bs, shuffle=True)
valid_samp = Sampler(valid_ds, bs, shuffle=False)

train_dl = DataLoader(train_ds, sampler=train_samp, collate_fn=collate)
valid_dl = DataLoader(valid_ds, sampler=valid_samp, collate_fn=collate)

xb, yb = next(iter(train_dl))
plt.imshow(xb[0].view(28,28))

<matplotlib.image.AxesImage at 0x26d59b12e20>

xb, yb = next(iter(train_dl))
plt.imshow(xb[0].view(28,28))

<matplotlib.image.AxesImage at 0x26d59b6c070>

Pytorch DataLoader¶

Now we'll use the Pytorch version.

It has a couple of particularly interesting args:

drop_last : drop the last incomplete batch
num_workers: how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.

from torch.utils.data import DataLoader, SequentialSampler, RandomSampler

train_dl = DataLoader(train_ds, bs, shuffle=True)
valid_dl = DataLoader(valid_ds, 128, shuffle=False)

model, opt = get_model()
fit()

vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc

(tensor(0.4490, grad_fn=<NllLossBackward>), tensor(0.8611))

Validation¶

A validation set is a key component to training properly: it is the only real indication we have that the model learning something useful.

Specifically, it signals to us whether or not the model is overfitting to the training data.

If we were to simply watch the training accuracy we would mostly likely see a continual improvement, as the loss diminishes and the accuracy increases to nearly 100%.

That only indicates the performance of the model on labelled data. If the model's learning algorithm is powerful and it has enough parameters it can memorize the training data.

But then, during inference time, when it is used to make predictions on unseen data, i.e. generalize, it may do horribly.

So let's build a more complete training loop that includes a validation error:

(Note: model.train() and model.eval() are used to turn on and off certain types of layers like Dropout and BatchNorm.)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        
        model.train()
        
        for xb, yb in train_dl:
            loss = loss_func(model(xb), yb)
            loss.backward()
            opt.step()
            opt.zero_grad()
            

        model.eval()
        vloss=[]
        acc=[]
        with torch.no_grad():
            for xb,yb in valid_dl:
                preds = model(xb)
                vloss.append(loss_func(preds, yb))
                acc.append(accuracy(preds, yb))
        print('loss:', sum([i.item() for i in vloss])/len(valid_dl))
        print('accuracy:', sum([i.item() for i in acc])/len(valid_dl))

model, opt = get_model()
fit(2, model, loss_func, opt, train_dl, valid_dl)

loss: 0.1706093008146633
accuracy: 0.9481803797468354
loss: 0.19026051393321045
accuracy: 0.9396756329113924

#export
def get_dls(train_ds, valid_ds, bs, **kwargs):
    return (DataLoader(train_ds, batch_size=bs, shuffle=True, **kwargs),
            DataLoader(valid_ds, batch_size=bs*2, **kwargs))

Creating our dataloaders, getting a model and optimizer, and training can be run in three lines of code:

get_dls(train_ds, valid_ds, 128)
model, opt = get_model()
fit(4, model, loss_func, opt, train_dl, valid_dl)

loss: 0.21858973362589185
accuracy: 0.9353243670886076
loss: 0.11040582522584856
accuracy: 0.9675632911392406
loss: 0.11853115729798999
accuracy: 0.9622231012658228
loss: 0.1047832053537846
accuracy: 0.9698378164556962

!python notebook2script.py 03_minibatch_training.ipynb

Converted 03_minibatch_training.ipynb to exp\nb_03.py