What is a dataloader?

A DataLoader is exactly what it sounds like, it loads data into whatever you want data for.

Dataloader components

A DataLoader might, for example, be used to load data into a model to train, oftentimes in FastAI you’ll see a DataLoader defined as “dls.” In FastAI, a DataLoader has a whole bunch of available parameters:

DataLoader (dataset=None, bs=None, num_workers=0, pin_memory=False, 
           timeout=0, batch_size=None, shuffle=False, drop_last=False, 
           indexed=None, n=None, device=None, persistent_workers=False, 
           pin_memory_device='', wif=None, before_iter=None, 
           after_item=None, before_batch=None, after_batch=None, 
           after_iter=None, create_batches=None, create_item=None, 
           create_batch=None, retain=None, get_idxs=None, sample=None, 
           shuffle_fn=None, do_batch=None)

Arguments to DataLoader:

  • dataset: dataset from which to load the data. Can be either map-style or iterable-style dataset.
  • bs (int): how many samples per batch to load (if batch_size is provided then batch_size will override bs). If bs=None, then it is assumed that dataset.getitem returns a batch.
  • num_workers (int): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.
  • pin_memory (bool): If True, the data loader will copy Tensors into CUDA pinned memory before returning them.
  • timeout (float>0): the timeout value in seconds for collecting a batch from workers.
  • batch_size (int): It is only provided for PyTorch compatibility. Use bs.
  • shuffle (bool): If True, then data is shuffled every time dataloader is fully read/iterated.
  • drop_last (bool): If True, then the last incomplete batch is dropped.
  • indexed (bool): The DataLoader will make a guess as to whether the dataset can be indexed (or is iterable), but you can override it with this parameter. True by default.
  • n (int): Defaults to len(dataset). If you are using iterable-style dataset, you can specify the size with n.
  • device (torch.device): Defaults to default_device() which is CUDA by default. You can specify device as torch.device(‘cpu’).