What is a dataloader?
A DataLoader is exactly what it sounds like, it loads data into whatever you want data for.
Dataloader components
A DataLoader might, for example, be used to load data into a model to train, oftentimes in FastAI you’ll see a DataLoader defined as “dls.” In FastAI, a DataLoader has a whole bunch of available parameters:
DataLoader (dataset=None, bs=None, num_workers=0, pin_memory=False,
timeout=0, batch_size=None, shuffle=False, drop_last=False,
indexed=None, n=None, device=None, persistent_workers=False,
pin_memory_device='', wif=None, before_iter=None,
after_item=None, before_batch=None, after_batch=None,
after_iter=None, create_batches=None, create_item=None,
create_batch=None, retain=None, get_idxs=None, sample=None,
shuffle_fn=None, do_batch=None)
Arguments to DataLoader:
- dataset: dataset from which to load the data. Can be either map-style or iterable-style dataset.
- bs (int): how many samples per batch to load (if batch_size is provided then batch_size will override bs). If bs=None, then it is assumed that dataset.getitem returns a batch.
- num_workers (int): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.
- pin_memory (bool): If True, the data loader will copy Tensors into CUDA pinned memory before returning them.
- timeout (float>0): the timeout value in seconds for collecting a batch from workers.
- batch_size (int): It is only provided for PyTorch compatibility. Use bs.
- shuffle (bool): If True, then data is shuffled every time dataloader is fully read/iterated.
- drop_last (bool): If True, then the last incomplete batch is dropped.
- indexed (bool): The DataLoader will make a guess as to whether the dataset can be indexed (or is iterable), but you can override it with this parameter. True by default.
- n (int): Defaults to len(dataset). If you are using iterable-style dataset, you can specify the size with n.
- device (torch.device): Defaults to default_device() which is CUDA by default. You can specify device as torch.device(‘cpu’).