build an efficient data pipeline

22 Sep, 2024

Before feeding data to the model for training we need to:

Load the data from the disk to memory.
Do preprocessing like normalization, data augmentation.

To build an efficient data pipeline pytorch has:

Dataset - a source of data like files and transformations.
DataLoader - interface to get the samples.

Our goal is to minimize GPU idle time and to do this we need to:

Minimize data transfer time between CPU and GPU.
Increase workers.

Data transfer from CPU to GPU

Data never gets directly copied to the pageable memory but there's a pin memory in between. To reduce this work, we can directly use this pin memory to write our data.

DataLoader(training_data, batch_size=128, pin_memory=True)

Request for the pinned memory can fail and there will be an increase in memory usage.

Refer to the blog post by nvidia to understand more.

DataLoader by default uses a single worker for execution.

So if we increase the number of workers, pytorch creates additional processes to handle multiple dataset samples asynchronously.

DataLoader(training_data, batch_size=128, num_workers=8)