from torch import tensor, nn
import pickle, gzip, torch, matplotlib as mpl
from pathlib import Path
import matplotlib.pyplot as plt
Convolutional Neural Networks
Convolutional Neural Networks
Let’s learn about convolutional neural networks (CNN). Rather than using linear layers, we mostly use CNN on images. Compared to using linear layers, we can get more accuracy fast because CNN has less parameters to train and can capture patterns easily.
First, we setup environments. For data, we will use MNIST digits.
='https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/data/mnist.pkl.gz?raw=true'
MNIST_URL= Path('data')
path_data =True)
path_data.mkdir(exist_ok= path_data/'mnist.pkl.gz' path_gz
from urllib.request import urlretrieve
if not path_gz.exists(): urlretrieve(MNIST_URL, path_gz)
=2, linewidth=140, sci_mode=False)
torch.set_printoptions(precision1)
torch.manual_seed('image.cmap'] = 'gray'
mpl.rcParams[
= Path('data')
path_data = path_data/'mnist.pkl.gz'
path_gz with gzip.open(path_gz, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
= map(tensor, [x_train, y_train, x_valid, y_valid]) x_train, y_train, x_valid, y_valid
Conv1d
Let’s go over how Pytroch’s convolutional neural networks work. We will start with a simple example, which is Conv1d
. Let’s say we have a vector with shape (1, 2)
. We can think of this vector as a sentence with two words, like “I sit,” and converted numbers into numbers so that computers can understand.
1)
torch.manual_seed(= torch.randn((1, 2))
sent1 sent1
tensor([[0.66, 0.27]])
Here is a Conv1d
layer with 1 input, 1 output, 2 kernel_size. When we look at the weight of this layer, it is the kernels. Its shape corresponds to (output, input, kernel_size)
. In this simple case, we have one kernel with size 2 kernel. The shape of bias depends on the output of the layer.
= nn.Conv1d(1, 1, kernel_size=2)
layer1 layer1
Conv1d(1, 1, kernel_size=(2,), stride=(1,))
layer1.weight, layer1.weight.shape
(Parameter containing:
tensor([[[-0.67, 0.42]]], requires_grad=True),
torch.Size([1, 1, 2]))
layer1.bias
Parameter containing:
tensor([-0.15], requires_grad=True)
When we apply this layer into our sentence, we get -0.26, which is the same as calculating a dot product of weight and the sentence added by bias.
layer1(sent1)
tensor([[-0.47]], grad_fn=<SqueezeBackward1>)
* sent1).sum() + layer1.bias (layer1.weight
tensor([-0.47], grad_fn=<AddBackward0>)
When we set the output to 2, that means there are two kernels: one for each output. We generally refer to output as channel. So, when we increae the ouput, we say we increase the channel. We will talk about why it would be useful later.
= nn.Conv1d(1, 2, kernel_size=2)
layer2 layer2
Conv1d(1, 2, kernel_size=(2,), stride=(1,))
layer2.weight, layer2.weight.shape
(Parameter containing:
tensor([[[ 0.36, 0.10]],
[[-0.09, 0.20]]], requires_grad=True),
torch.Size([2, 1, 2]))
layer2.bias
Parameter containing:
tensor([0.03, 0.26], requires_grad=True)
layer2(sent1)
tensor([[0.30],
[0.25]], grad_fn=<SqueezeBackward1>)
0] * sent1).sum() + layer2.bias[0] (layer2.weight[
tensor(0.30, grad_fn=<AddBackward0>)
1] * sent1).sum() + layer2.bias[1] (layer2.weight[
tensor(0.25, grad_fn=<AddBackward0>)
That was simple. What if sentence is longer than 2 words? Let’s say we have 4 words this time. When we apply layer1
to it, we get 3 numbers.
= torch.randn((1, 4))
sent2 sent2
tensor([[-1.33, 0.52, 0.75, -0.08]])
layer1(sent2)
tensor([[ 0.96, -0.17, -0.67]], grad_fn=<SqueezeBackward1>)
We can imagine our kernel as a sliding window moving from left to right. That’s what stride
is. By default it is 1, and it moves 1 step by step. This is how it is calculated:
for words in [sent2[0][i:i+2] for i in range(len(sent2[0]) - 1)]:
print((layer1.weight * words).sum() + layer1.bias)
tensor([0.96], grad_fn=<AddBackward0>)
tensor([-0.17], grad_fn=<AddBackward0>)
tensor([-0.67], grad_fn=<AddBackward0>)
Stride and Padding
So that’s basically how convolution works. By setting different values to stride, we can change how many times the kernel moves each step. By setting stride to 2, we skip some computations.
= nn.Conv1d(1, 1, kernel_size=2, stride=2)
layer3 layer3
Conv1d(1, 1, kernel_size=(2,), stride=(2,))
layer3(sent2)
tensor([[ 0.20, -0.61]], grad_fn=<SqueezeBackward1>)
Because the kernel moved twice each step, it is now instead of 3 values, we have 2 values as an output. This way, we can reduce the number of pixels of the input, which means we can store information more compactly like zipping a file. However, this might actually lose data if we shrink too much. Therefore, we increase the channel of the input when we use stride to balance it out.
For instance, if we use stride 2, we half height and width, which results in 4 times less pixels. By adding a channel, we increase the output by 2 times. Therefore, we only divide the pixel by 2 each layer, which is reasonable for neural net to learn patterns.
What is a padding then? It is a way to put zeroes in the corner. If we put padding as 2, we put 2 zeroes on the left and right of our sentence before applying kernels. Therefore, first and last values are equal to just biases in this example.
sent2
tensor([[-1.33, 0.52, 0.75, -0.08]])
= nn.Conv1d(1, 1, kernel_size=2, stride=2, padding=2)
layer4 layer4
Conv1d(1, 1, kernel_size=(2,), stride=(2,), padding=(2,))
layer4(sent2)
tensor([[0.42, 0.74, 0.25, 0.42]], grad_fn=<SqueezeBackward1>)
layer4.bias
Parameter containing:
tensor([0.42], requires_grad=True)
Conv2d
We’ve been using conv1d
, but conv2d works the same way except we have 2 dimensional kernels instead of 1d, and they also move down.
= nn.Conv2d(1, 1, kernel_size=2, stride=1, padding=0)
layer5 layer5
Conv2d(1, 1, kernel_size=(2, 2), stride=(1, 1))
layer5.weight.shape, layer5.bias.shape
(torch.Size([1, 1, 2, 2]), torch.Size([1]))
Shape of weight: (Outp, Inp, *Kernel_size)
Shape of bias: Outp
Let’s start with a simple example again. Here is a small image with 3 pixel high and 3 pixel wide.
= torch.randn((1, 3, 3))
im1 im1
tensor([[[-0.42, -0.51, -1.57],
[-0.12, 3.59, -1.83],
[ 1.60, -1.28, 0.33]]])
im1.shape
torch.Size([1, 3, 3])
layer5(im1), layer5(im1).shape
(tensor([[[ 1.07, 1.96],
[-1.00, 1.99]]], grad_fn=<SqueezeBackward1>),
torch.Size([1, 2, 2]))
So, now let’s talk about different kernels. Why do we even use kernels? What do they do? So, let’s say we have an image of digit 5 from MNIST.
def show_im(im, figsize=None):
= plt.subplots(figsize=figsize)
fig, ax if all(hasattr(im, x) for x in ('cpu','permute','detach')):
= im.detach().cpu()
im if len(im.shape)==3 and im.shape[0]<5: im=im.permute(1,2,0)
ax.set_xticks([])
ax.set_yticks([]) ax.imshow(im)
= x_train[0].reshape((1, 28, 28))
im2 =(3,3)) show_im(im2, figsize
= nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=0)
layer6 layer6, layer6.weight.shape
(Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1)), torch.Size([1, 1, 3, 3]))
Here is a kernel that detects the top edge. We can apply this kernel to find top edge from an image. Let’s see what it looks like.
= torch.nn.Parameter(tensor([[1,1,1],[0,0,0],[-1,-1,-1]],
top_edge =torch.float32).reshape((1, 1, 3, 3)))
dtype top_edge
Parameter containing:
tensor([[[[ 1., 1., 1.],
[ 0., 0., 0.],
[-1., -1., -1.]]]], requires_grad=True)
top_edge.shape
torch.Size([1, 1, 3, 3])
= top_edge
layer6.weight layer6.weight
Parameter containing:
tensor([[[[ 1., 1., 1.],
[ 0., 0., 0.],
[-1., -1., -1.]]]], requires_grad=True)
=(3,3)); show_im(layer6(im2), figsize
We successfully detected top edge of the image! Let’s try to get the right edge.
= nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=0)
layer7 layer7
Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1))
= torch.nn.Parameter(tensor([[-1,0,1],
right_edge -1,0,1],
[-1,0,1]], dtype=torch.float32).unsqueeze(0).unsqueeze(0))
[ right_edge
Parameter containing:
tensor([[[[-1., 0., 1.],
[-1., 0., 1.],
[-1., 0., 1.]]]], requires_grad=True)
= right_edge
layer7.weight =(3,3)); show_im(layer7(im2), figsize
= nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=0)
layer00 layer00
Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1))
= torch.nn.Parameter(tensor([[1,1,1],[1,1,1],[1,1,1]],
no_op =torch.float32).reshape((1, 1, 3, 3)))
dtype no_op
Parameter containing:
tensor([[[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]]], requires_grad=True)
=(3,3)) show_im(im2, figsize
= no_op
layer00.weight =(3,3)); show_im(layer00(im2), figsize
Instead of creating two layers, we usually create multiple kernels. layer8
has two output channels, which means two kernels. We can set these two kernels into what we had, and we can check the output.
= nn.Conv2d(1, 2, kernel_size=3, stride=1, padding=0)
layer8 layer8.weight
Parameter containing:
tensor([[[[ 0.16, 0.02, 0.11],
[ 0.07, 0.12, 0.17],
[-0.31, 0.17, -0.23]]],
[[[-0.25, 0.02, -0.06],
[ 0.20, -0.19, -0.30],
[ 0.24, -0.05, 0.19]]]], requires_grad=True)
layer8.weight.shape
torch.Size([2, 1, 3, 3])
top_edge.shape
torch.Size([1, 1, 3, 3])
= torch.nn.Parameter(torch.cat([top_edge, right_edge]))
layer8.weight layer8.weight
Parameter containing:
tensor([[[[ 1., 1., 1.],
[ 0., 0., 0.],
[-1., -1., -1.]]],
[[[-1., 0., 1.],
[-1., 0., 1.],
[-1., 0., 1.]]]], requires_grad=True)
def show_ims(ims, figsize=None):
= plt.subplots(1, len(ims), figsize=figsize)
fig, ax for i, im in enumerate(ims):
if all(hasattr(im, x) for x in ('cpu','permute','detach')):
= im.detach().cpu()
im if len(im.shape)==3 and im.shape[0]<5: im=im.permute(1,2,0)
ax[i].imshow(im)
ax[i].set_xticks([]) ax[i].set_yticks([])
show_ims([im3, im4])
Nice. We got the same output as before.
= nn.Conv2d(2, 1, kernel_size=3, stride=1, padding=0)
layer9 layer9.weight.shape
torch.Size([1, 2, 3, 3])
layer9.weight
Parameter containing:
tensor([[[[ 0.05, 0.06, -0.16],
[-0.11, 0.08, 0.04],
[-0.10, -0.07, 0.22]],
[[-0.04, 0.13, 0.10],
[-0.15, -0.20, 0.23],
[ 0.01, 0.16, 0.05]]]], requires_grad=True)
= torch.nn.Parameter(tensor([[1,1,1],[1,1,1],[1,1,1]],
no_op =torch.float32).reshape((3, 3)))
dtype no_op
Parameter containing:
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]], requires_grad=True)
0).shape torch.stack([no_op, no_op]).unsqueeze(
torch.Size([1, 2, 3, 3])
= torch.nn.Parameter(torch.stack([no_op, no_op]).unsqueeze(0))
no_ops no_ops
Parameter containing:
tensor([[[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]]], requires_grad=True)
torch.stack([im3, im4]).shape
torch.Size([2, 26, 26])
= no_ops
layer9.weight =((3,3))) show_im(layer9(torch.stack([im3, im4])), figsize
So, we combined the output of two images and mixed them into one. In this image, the darkest spots are places where top corner and right corner both happen. This way, the convolutional neural network can combine different features into a more useful one.
Conclusion
In this blog, we learned how convolutional neural net works by using Pytorch. It is used mostly for images. If you want to learn more, take a look at FastAI lesson 8. It might be easier to watch the video to undertstand, and you can come back to play around with the code.
For a challenge, you can try to train MNIST using both linear layers and convolutional neural networks, and try to find out which one is better.