Hugging Face Features

Author

galopy

Published

July 8, 2023

Hugging Face Datasets

In lesson 15 of the Practical Deep Learning For Coders, we used Hugging Face Datasets to download Fasion MNIST data and trained our model. I faced a problem here because I could not fit my model fast enough. Even using all my CPU cores, it was not as fast as Jeremy’s computer. Even on Google Colab, CPU was not strong. It was not a huge problem, but it was annoying, so I decided to find a solution.

In the lesson, we downloaded images and applied a transform function to convert them into tensors. With dsd.with_transform, the transform happened every batch and it took the most of the time. We don’t have to apply transform every batch. So, let’s find a way to only do it once.

Initially, I just wanted to convert images into tensors with map, but Hugging Face used Apache Arrow, which does not support tensors type. So, I used Hugging Face Features and Array2D to fix this problem.

Here is the orginal approach that takes a long time from the course.

Original approach

from miniai.datasets import *
from miniai.conv import *
from miniai.conv import *
from datasets import load_dataset,load_dataset_builder

import torch
import torchvision.transforms.functional as TF
from torch import optim, nn,tensor
import torch.nn.functional as F

import fastcore.all as fc
import logging
logging.disable(logging.WARNING)

from tqdm import tqdm

First, we grab the data from Hugging Face.

x,y = 'image','label'
name = "fashion_mnist"
dsd = load_dataset(name)

Here is a inplace transform function. This function is applied every batch and converts images into tensors with the right shape.

@inplace
def transformi(b): b[x] = [torch.flatten(TF.to_tensor(o)) for o in b[x]]

Since with_transform applies the transform function every new batch, this is good for applying data augmentations or place where we want randomness.

bs = 1024
tds = dsd.with_transform(transformi)

Now we make a Pytorch DataLoaders. We can say how many processors we want to use. We are using 4 here.

dls = DataLoaders.from_dd(tds, bs, num_workers=4)
dt = dls.train
xb,yb = next(iter(dt))
xb.shape,yb[:10]
(torch.Size([1024, 784]), tensor([2, 6, 7, 4, 9, 5, 3, 5, 6, 7]))

This is the Learner class. It is not very flexible, but it works.

class Learner:
    def __init__(self, model, dls, loss_func, lr, opt_func=optim.SGD): fc.store_attr()

    def one_batch(self):
        self.xb,self.yb = to_device(self.batch)
        self.preds = self.model(self.xb)
        self.loss = self.loss_func(self.preds, self.yb)
        if self.model.training:
            self.loss.backward()
            self.opt.step()
            self.opt.zero_grad()
        with torch.no_grad(): self.calc_stats()

    def calc_stats(self):
        acc = (self.preds.argmax(dim=1)==self.yb).float().sum()
        self.accs.append(acc)
        n = len(self.xb)
        self.losses.append(self.loss*n)
        self.ns.append(n)

    def one_epoch(self, train):
        self.model.training = train
        dl = self.dls.train if train else self.dls.valid
        for self.num,self.batch in enumerate(dl): self.one_batch()
        n = sum(self.ns)
        print(self.epoch, self.model.training, sum(self.losses).item()/n, sum(self.accs).item()/n)
    
    def fit(self, n_epochs):
        self.accs,self.losses,self.ns = [],[],[]
        self.model.to(def_device)
        self.opt = self.opt_func(self.model.parameters(), self.lr)
        self.n_epochs = n_epochs
        for self.epoch in range(n_epochs):
            self.one_epoch(True)
            with torch.no_grad(): self.one_epoch(False)
m,nh = 28*28,50
model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))

We fit, but this is not very fast.

learn = Learner(model, dls, F.cross_entropy, lr=0.2)
learn.fit(1)
# Using only 1
0 True 1.1959598958333333 0.6107833333333333
0 False 1.1534678571428572 0.6217571428571429
CPU times: user 5.41 s, sys: 461 ms, total: 5.87 s
Wall time: 7.88 s
learn = Learner(model, dls, F.cross_entropy, lr=0.2)
learn.fit(1)
# Using 4
0 True 0.7164356770833333 0.7443166666666666
0 False 0.7154278459821428 0.7437571428571429
CPU times: user 4.6 s, sys: 434 ms, total: 5.03 s
Wall time: 7.79 s

Okay. We used 4 processors to train the model here but it is still not very fast. Let’s make it faster!

Faster fit

By using Hugging Face Features, we can turn images into tensors when we download the data. First, we use load_data_builder to look at the metadata, such as the features, splits, description of the data, and etc. without actually downloading the data yet.

builder = load_dataset_builder(name)
builder.info.features
{'image': Image(decode=True, id=None),
 'label': ClassLabel(names=['T - shirt / top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'], id=None)}
dsd_features = dsd['train'].features.copy()
dsd_features
{'image': Image(decode=True, id=None),
 'label': ClassLabel(names=['T - shirt / top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'], id=None)}
from datasets import Features, Array2D

We use Array2D to turn the images into 2D arrays with a certain shape and dtype. It is a bit weird using Array2D and shape=[1, 28*28] instead of something like Array or Array1D and shape=[28*28]. However, Hugging Face does not have that. We can just use map to unsqueeze it. However, this won’t be a problem with colored images.

dsd_features['image'] = Array2D(shape=[1, 28*28], dtype='float32')
dsd_features
{'image': Array2D(shape=(1, 784), dtype='float32', id=None),
 'label': ClassLabel(names=['T - shirt / top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'], id=None)}

Now we load the dataset using those features, but this is a list! Why is it not a tensor? We have to set the format to torch in order to make it as tensor.

dsd = load_dataset(name, features=dsd_features)
type(dsd['train'][0][x])
list
dsd.set_format(type="torch")
type(dsd['train'][0][x])
torch.Tensor
dsd['train'][0][x].shape
torch.Size([1, 784])

Now, we just need to squeeze each tensor to get rid of useless 1 in the shape.

@inplace
def sq(b): b[x] = [o.squeeze().div(255) for o in b[x]]

Here, we use map to squeeze them. With batched=True, it is faster.

tds = dsd.map(sq, batched=True)
tds['train'][0][x].shape
torch.Size([784])
Why not just use torch.tensor?

So, why didn’t we just use torch.tensor in the beginning and used Features and Array2D? Because Hugging Face converts tensors back to images. Hugging Face uses Apache Arrow, and it does not support tensors are not supported. So data have to be either list or image, and we do not want image.

Now, it is in the right shape. However, the difference is that it does not have to keep converting from image to tensor every batch. With map, there is no calculation on flight, which is what we want here.

dls = DataLoaders.from_dd(tds, bs, num_workers=0)
dt = dls.train
xb,yb = next(iter(dt))
xb.shape,yb[:10]
(torch.Size([1024, 784]), tensor([2, 0, 0, 0, 0, 7, 0, 5, 5, 2]))

Now, it is very fast to train even with only one worker.

learn = Learner(model, dls, F.cross_entropy, lr=0.2)
learn.fit(1)
0 True 0.6185346354166666 0.7802833333333333
0 False 0.6170732700892857 0.7807571428571428
CPU times: user 5.4 s, sys: 225 ms, total: 5.63 s
Wall time: 2.82 s

Conclusion

We used Features and Array2D to convert images into tensors for faster training. It was awkward using Array2D when we want Array1D, but it was not a problem.