%reload_ext autoreload
%autoreload 2
%matplotlib inline
Basic training loop, Parameters, and Optim
We're going to start training!
#export
from exp.nb_02 import *
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler
x_train,y_train,x_valid,y_valid = get_data()
x_train.shape
n, m = x_train.shape
c = y_train.max()+1;c
nh = 50
class Model(nn.Module):
def __init__(self, n_in, nh, n_out):
super().__init__()
self.layers = [nn.Linear(n_in, nh), nn.ReLU(), nn.Linear(nh, n_out)]
def __call__(self, x):
for l in self.layers: x = l(x)
return x
model = Model(m, nh, 10)
pred = model(x_train)
Each grayscale image of a handwritten shape in our dataset has exactly 1 correct answer - an integer between 0 and 9 - these are called the labels or targets.
Our y_train
is a tensor of integers that map to the x_train
images.
We can index into the y_train
to see the labels to images 0, 1, and 2.
y_train[:3]
We could also think of these labels as one-hot encoded vectors of length 10 where the label corresponds to a 1 at the label's index and everything else is a zero.
y0 = torch.zeros(10)
y0[y_train[0]] = 1
y0
The model's output is simply a length 10 vector for every example that is the result of numerous matrix multiplications:
pred[0]
What we would like is a probability distribution over each of our 10 classes: each class gets a probabilty, the highest corresponds to the class the model has learned is the most "correct". The model most therefore learn and adjust its parameters by quantifying how wrong its guess was.
In order to accomplish this for our multi-class problem we'll use cross entropy loss.
The first step is to scale the outputs by putting them through a softmax function:
$$\hbox{softmax(x)}_{i} = \frac{e^{x_{i}}}{\sum_{0 \leq j \leq n-1} e^{x_{j}}}$$
This turns our length 10 output vector into a probability distribution.
For example, for the numerator, to raise the first two rows to the e is just:
pred[:2].exp()
The denominator is trickier because we don't want to sum all of the rows together. We need divide each exponentiated value by its own row.
Therefore, this won't work because it lumps everything together:
pred[:2].exp().sum()
Instead we need to pass keepdim=True
pred[:2].exp().sum(-1, keepdim=True)
Finally, here are the first two rows softmaxed:
soft_preds = pred[:2].exp() / pred[0].exp().sum(-1, keepdim=True); soft_preds
And if we sum a row we get 1:
soft_preds[0].sum()
Here is our function:
def log_softmax(x): return (x.exp() / (x.exp().sum(-1, keepdim=True))).log()
soft_preds1 = log_softmax(pred)
We can immediately refactor this by remembering that:
$$ \log{\frac{a}{b}} = \log{a} - \log{b} $$
Therefore:
$$ \displaystyle \log{\frac{e^x_i}{\sum_{j=0}^{n-1} e^x_j}} = \log{e^x} - \log{\sum_{j=0}^{n-1} e^x_j} = x - \log{\sum_{j=0}^{n-1} e^x_j} $$
In code this is:
def log_softmax(x): return x - x.exp().sum(-1,keepdim=True).log()
soft_preds = log_softmax(pred)
test_near(soft_preds, soft_preds1)
Now that we have an output vector of predictions, $\hat{y}$, in the form of a probability distribution over the possible classes of $y$ (0-9) we can use cross entropy loss to caculate just how far off our prediction is from the target value.
We're trying to find how dissimilar our prediction is to the target. So we are comparing the two distributions.
Assuming our $y$ is one-hot encoded, we calculate the cross entropy loss for a single example by taking the dot product of the two vectors:
$$ L = -y \cdot \log{\hat{y}} $$
-(y0 @ soft_preds[0])
But we can do this another way without having to one-hot encode our labels.
We'll use integer array indexing - we can pass a list of integers for each dimension and get back those specific rows/columns.
-soft_preds[[0],[y_train[0]]]
Now we turn that into a proper loss function which takes averages all of the negative loss logs over the entire output.
def nll(input, target): return -input[range(target.shape[0]), target].mean()
loss = nll(soft_preds, y_train); loss
Then, there is a way to compute the log of the sum of exponentials in a more stable way, called the LogSumExp trick. The idea is to use the following formula:
$$\log \left ( \sum_{j=1}^{n} e^{x_{j}} \right ) = \log \left ( e^{a} \sum_{j=1}^{n} e^{x_{j}-a} \right ) = a + \log \left ( \sum_{j=1}^{n} e^{x_{j}-a} \right )$$
where a is the maximum of the $x_{j}$.
def logsumexp(x):
m = x.max(-1)[0] # grab the largest number in x
print(m)
return m + (x-m[:,None]).exp().sum(-1).log() # subtract it out and then add it back in the end
We've now built our loss function so let's go ahead and use the Pytorch version which combines log_softmax
and nll_loss
in a single function.
pyloss = F.cross_entropy(pred, y_train); pyloss
test_near(pyloss, loss)
It works!
We have all the parts now to create an algorithm!
The training loop is combines everything we have done so far into an interative process. We loop over the data again and again to fine-tune our model's parameters.
Here is what we need the training loop to do:
loss_func = F.cross_entropy
It would be nice if we had some sort of metric to follow to see how many of the training examples we are getting correct.
We'll start with accuracy.
torch.argmax(pred, dim=1)[:4]
#export
def accuracy(yh, y): return (torch.argmax(yh, dim=1)==y).float().mean()
Let's create a single mini-batch to test that it works:
bs=128 # batch size
xb = x_train[0:bs] # a mini-batch from x
preds = model(xb) # predictions
This will give us 128 predictions - each prediction here being a vector of length 10.
preds[0], preds.shape
Now we can put these predictions through our loss function with our labels and get some sort of measurement as to how far off they are:
yb = y_train[0:bs]
loss = loss_func(preds, yb); loss
accuracy(preds, yb)
About 10% accuracy, that's basically choosing randomly.
We have a long way to go but at least everything appears to be working.
Now let's look at our model's weights:
model.layers[0].weight
model.layers[0].weight.grad
At this stage the gradients with respect to the loss have not been computed.
Pytorch only computes them once .backward
is called. Let's do that:
loss.backward()
model.layers[0].weight.grad[0][:100]
Let's finally begin to train.
Three hyperparameters we need to set are the batch size, the learning rate and the number of epochs (the number of times we iterate through the entire dataset)
bs = 64
lr = 0.5
epochs = 1
for epoch in range(epochs):
for i in range(0,n,bs):
xb = x_train[i:i+bs]
yb = y_train[i:i+bs]
loss = loss_func(model(xb), yb)
loss.backward()
with torch.no_grad():
for l in model.layers:
if hasattr(l, 'weight'):
l.weight -= lr * l.weight.grad
l.bias -= l.bias.grad * lr
l.weight.grad.zero_()
l.bias .grad.zero_()
loss_func(model(xb), yb)
And it's >90% accuracy. Not terrible for a simple neural net.
accuracy(model(x_train[:512]), y_train[:512])
We'll start by no longer treating our Relu as a separate layer. Instead we'll use F.relu
which is the functional form that returns activations.
We then have two linear layers from nn
these linear layers are automatically registered by the nn.Module
class as the parameters of the model.
We can call model.parameters()
now and it will return a generator that does essentially what we were doing manually by iterating through the list of layers and checking for the weight
attribute.
nh = 64
class Model(nn.Module):
def __init__(self, n_in, n_out):
super().__init__()
self.l1 = nn.Linear(n_in, nh)
self.l2 = nn.Linear(nh, n_out)
def forward(self, x):
return self.l2(F.relu(self.l1(x)))
n,m
model = Model(m, 10)
Let's take a look inside our model.
We can do this by calling the .named_children
method on the model.
Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
for l in model.named_children(): print(f"{l[0]} : {l[1]}")
Pytorch's nn.Module
has __repr__
defined as the following:
def __repr__(self):
# We treat the extra repr like the sub-module, one item per line
extra_lines = []
extra_repr = self.extra_repr()
# empty string will be split into list ['']
if extra_repr:
extra_lines = extra_repr.split('\n')
child_lines = []
for key, module in self._modules.items():
mod_str = repr(module)
mod_str = _addindent(mod_str, 2)
child_lines.append('(' + key + '): ' + mod_str)
lines = extra_lines + child_lines
main_str = self._get_name() + '('
if lines:
# simple one-liner info, which most builtin Modules will use
if len(extra_lines) == 1 and not child_lines:
main_str += extra_lines[0]
else:
main_str += '\n ' + '\n '.join(lines) + '\n'
main_str += ')'
return main_str`
model
Let's create a function that we can call which will run our training loop.
This is standard for Machine Learning libraries like Sklearn.
def fit():
print("Training...")
for epoch in range(epochs):
for i in range(0,n,bs):
end = i+bs if i+bs < n else n
xb = x_train[i:end]
yb = y_train[i:end]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
with torch.no_grad():
for p in model.parameters():
p -= p.grad * lr
model.zero_grad()
fit()
loss_func(model(xb),yb)
accuracy(model(x_valid), y_valid)
Question: How does Pytorch know what attributes in __init__
to set as the model parameters?
In Python every time an attribute is assigned during a class initialization, __setattr__()
is called.
When we inherit from nn.Module
and then execute super().__init__()
Pytorch creates the following 'private' attributes:
self.training = True
self._parameters = OrderedDict()
self._buffers = OrderedDict()
self._backward_hooks = OrderedDict()
self._forward_hooks = OrderedDict()
self._forward_pre_hooks = OrderedDict()
self._state_dict_hooks = OrderedDict()
self._load_state_dict_pre_hooks = OrderedDict()
self._modules = OrderedDict()
When the model is instantiated from the Model
class and self.l1 = nn.Linear
is set as in attribute __setattr__()
is called.
Pytorch then does the following:
nn.Module
__init__
was calledclass Parameter
'''A kind of Tensor that is to be considered a module parameter.
Parameters are ~torch.Tensor subclasses, that have a very special property when used with Module s - when they're assigned as Module attributes they are automatically added to the list of its parameters, and will appear e.g. in ~Module.parameters iterator. Assigning a Tensor doesn't have such effect. This is because one might want to cache some temporary state, like last hidden state of the RNN, in the model. If there was no such class as Parameter, these temporaries would get registered too.
Arguments: data (Tensor): parameter tensor. requires_grad (bool, optional): if the parameter requires gradient. See excluding-subgraphs for more details. Default: True'''
class Module
'''Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes:'''
We can demonstrate this by building a dummy module that has a dictionary called _modules
class DummyModule():
def __init__(self, n_in, n_out):
self._modules = {}
self.l1 = nn.Linear(n_in, nh)
self.l2 = nn.Linear(nh, n_out)
def __setattr__(self, k,v):
if not k.startswith("_"): # register any keys that do not start with '_'
self._modules[k] = v # put it inside modules dict
super().__setattr__(k,v)
def __repr__(self): return f'{self._modules}'
def parameters(self):
for l in self._modules.values():
for p in l.parameters(): yield p
mdl = DummyModule(m, 10)
mdl
[o.shape for o in mdl.parameters()]
Now let's say we wanted to use the layers approach that we wrote earlier.
layers = [nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh, 10)]
class Model(nn.Module):
def __init__(self, layers):
super().__init__() # sets up the parameters module dict and other dicts
self.layers = layers
for i,l in enumerate(self.layers):
self.add_module(f'l{i}',l)
def forward(self, x):
for l in self.layers: x = l(x)
return x
model = Model(layers)
model
If we insist on using the layers technique we can use the built in Pytorch nn.ModuleList
class Model(nn.Module):
def __init__(self, layers):
super().__init__()
self.layers = nn.ModuleList(layers)
def forward(self, x):
for l in self.layers: x = l(x)
return x
model = Model(layers)
model
fit()
loss_func(model(xb), yb), accuracy(model(xb), yb)
Even easier would be to use the Pytorch nn.Sequential
model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh, 10))
model
fit()
loss_func(model(xb), yb), accuracy(model(xb), yb)
Let's refactor our optimization step.
In our training loop we called backward on the loss to compute the gradients and then to make the actual updates to the weights we did the following:
with torch.no_grad():
for p in model.parameters(): p -= p.grad * lr
model.zero_grad()
We can simplify this bit of the loop if we instead put this away into an Optimizer
class which will then have two separate methods:
opt.step()
opt.zero_grad()
class Optimizer():
def __init__(self, model, lr):
self.model = model
self.lr = lr
def step(self):
with torch.no_grad():
for p in self.model.parameters():
p -= p.grad * self.lr
def zero(self):
self.model.zero_grad()
opt = Optimizer(model, lr)
for epoch in range(epochs):
for i in range(0,n,bs):
end = i+bs if i+bs < n else n
xb = x_train[i:end]
yb = y_train[i:end]
loss = loss_func(model(xb), yb)
loss.backward()
opt.step()
opt.zero()
loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc
Now let's see the Pytorch version.
optim.SGD.step
iterates through each param_group
and then again through each group's params
key which contain the layers in that group
opt = optim.SGD(model.parameters(), lr=lr)
opt.param_groups
Let's make a quick function to get model and an optimizer:
#export
def get_model():
model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, 10))
return model, optim.SGD(model.parameters(), lr=lr)
model, opt = get_model()
for epoch in range(epochs):
for i in range(0,n,bs):
end = i+bs if i+bs < n else n
xb=x_train[i:end]
yb=y_train[i:end]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc
Another part of our training loop we can improve is:
for i in range(0,n,bs):
end = i+bs if i+bs < n else n
xb=x_train[i:end]
yb=y_train[i:end]
Let's build a Dataset
class that will hold the x
's and y
's in one object.
xb, yb = train_ds[i:end]
#export
class Dataset():
def __init__(self, x, y):
self.x, self.y = x,y
def __len__(self):
return len(self.x)
def __getitem__(self,key):
return self.x[key], self.y[key]
train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)
assert len(train_ds) == len(x_train)
xb,yb = train_ds[0:5]
assert xb.shape==(5,28*28)
assert yb.shape==(5,)
xb,yb
model, opt = get_model()
for epoch in range(epochs):
for i in range(0,n,bs):
end = i+bs if i+bs < n else n
xb, yb = train_ds[i:end]
loss = loss_func(model(xb), yb)
loss.backward()
opt.step()
opt.zero_grad()
vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc
And we can use the same refactoring logic to make it so we can simply pull x
and y
batches out of a DataLoader
class that holds the Dataset
class we just made.
for xb, yb in train_dl:
preds = model(xb)
loss = loss_func(preds, yb)
class DataLoader():
def __init__(self, ds, bs):
self.ds = ds
self.bs = bs
def __len__(self):
return len(self.ds/self.bs)
def __iter__(self):
for i in range(0, len(self.ds), self.bs):
yield self.ds[i:i+self.bs]
train_dl = DataLoader(train_ds, 64)
valid_dl = DataLoader(valid_ds, 128)
xb, yb = next(iter(valid_dl))
assert xb.shape == (128, 28*28)
plt.imshow(xb[0].view(28, 28))
yb[0]
model, opt = get_model()
def fit():
for epoch in range(epochs):
for xb, yb in train_dl:
loss = loss_func(model(xb), yb)
loss.backward()
opt.step()
opt.zero_grad()
fit()
vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc
For certain datasets where the dependent variable is in a specific order we would want to shuffle the data before loading it into batches and putting it through the model.
Every epoch we'll iterate through the entire dataset randomly. The model will get to see each training example once but the order will be different each time.
To do this we need our Dataloader
to send a batch size of random integers that are within the range of the dataset.
We'll need permutations:
torch.randperm(10)
class DataLoader():
def __init__(self, ds, bs, shuffle=False):
self.ds = ds
self.bs = bs
self.shuffle = shuffle
def __iter__(self):
if self.shuffle:
perms = torch.randperm(len(self.ds))
for i in range(0, len(perms), bs): yield self.ds[perms[i:self.bs]]
else:
for i in range(0, len(self.ds), self.bs): yield self.ds[i:i+self.bs]
test_dl = DataLoader(train_ds, 128, shuffle=True)
test_valid_dl = DataLoader(valid_ds, 128, shuffle=False)
xb, yb = next(iter(test_dl))
plt.imshow(xb[0].view(28,28))
xb, yb = next(iter(test_dl))
plt.imshow(xb[0].view(28,28))
xb, yb = next(iter(test_dl))
plt.imshow(xb[0].view(28,28))
model, opt = get_model()
fit()
vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc
class Sampler():
def __init__(self, ds, bs, shuffle=False):
self.n = len(ds) # just the length of the dataset not the whole dataset
self.bs = bs
self.shuffle = shuffle
def __iter__(self):
self.idxs = torch.randperm(self.n) if self.shuffle else torch.arange(self.n)
for i in range(0, self.n, self.bs): yield self.idxs[i:i+self.bs]
small_ds = Dataset(*train_ds[:10])
To test this out let's see when shuffle=False
s = Sampler(small_ds, 3, False)
[x for x in s]
s = Sampler(small_ds, 3, True)
[x for x in s]
def collate(b):
xs, ys = zip(*b)
return torch.stack(xs), torch.stack(ys)
class DataLoader():
def __init__(self, ds, sampler, collate_fn=collate):
self.ds = ds
self.sampler = sampler
self.collate_fn = collate_fn
def __iter__(self):
for s in self.sampler: yield self.collate_fn([self.ds[i] for i in s])
train_samp = Sampler(train_ds, bs, shuffle=True)
valid_samp = Sampler(valid_ds, bs, shuffle=False)
train_dl = DataLoader(train_ds, sampler=train_samp, collate_fn=collate)
valid_dl = DataLoader(valid_ds, sampler=valid_samp, collate_fn=collate)
xb, yb = next(iter(train_dl))
plt.imshow(xb[0].view(28,28))
xb, yb = next(iter(train_dl))
plt.imshow(xb[0].view(28,28))
Now we'll use the Pytorch version.
It has a couple of particularly interesting args:
drop_last
: drop the last incomplete batch num_workers
: how many subprocesses to use for data loading. 0
means that the data will be loaded in the main process.from torch.utils.data import DataLoader, SequentialSampler, RandomSampler
train_dl = DataLoader(train_ds, bs, shuffle=True)
valid_dl = DataLoader(valid_ds, 128, shuffle=False)
model, opt = get_model()
fit()
vpreds = model(x_valid)
loss,acc = loss_func(vpreds, y_valid), accuracy(vpreds, y_valid)
loss,acc
A validation set is a key component to training properly: it is the only real indication we have that the model learning something useful.
Specifically, it signals to us whether or not the model is overfitting to the training data.
If we were to simply watch the training accuracy we would mostly likely see a continual improvement, as the loss diminishes and the accuracy increases to nearly 100%.
That only indicates the performance of the model on labelled data. If the model's learning algorithm is powerful and it has enough parameters it can memorize the training data.
But then, during inference time, when it is used to make predictions on unseen data, i.e. generalize, it may do horribly.
So let's build a more complete training loop that includes a validation error:
(Note: model.train()
and model.eval()
are used to turn on and off certain types of layers like Dropout and BatchNorm.)
def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
for epoch in range(epochs):
model.train()
for xb, yb in train_dl:
loss = loss_func(model(xb), yb)
loss.backward()
opt.step()
opt.zero_grad()
model.eval()
vloss=[]
acc=[]
with torch.no_grad():
for xb,yb in valid_dl:
preds = model(xb)
vloss.append(loss_func(preds, yb))
acc.append(accuracy(preds, yb))
print('loss:', sum([i.item() for i in vloss])/len(valid_dl))
print('accuracy:', sum([i.item() for i in acc])/len(valid_dl))
model, opt = get_model()
fit(2, model, loss_func, opt, train_dl, valid_dl)
#export
def get_dls(train_ds, valid_ds, bs, **kwargs):
return (DataLoader(train_ds, batch_size=bs, shuffle=True, **kwargs),
DataLoader(valid_ds, batch_size=bs*2, **kwargs))
Creating our dataloaders, getting a model and optimizer, and training can be run in three lines of code:
get_dls(train_ds, valid_ds, 128)
model, opt = get_model()
fit(4, model, loss_func, opt, train_dl, valid_dl)
!python notebook2script.py 03_minibatch_training.ipynb