build an efficient data pipeline
Before feeding data to the model for training we need to:
- Load the data from the disk to memory.
- Do preprocessing like normalization, data augmentation.
To build an efficient data pipeline pytorch has:
Dataset- a source of data like files and transformations.DataLoader- interface to get the samples.
Our goal is to minimize GPU idle time and to do this we need to:
- Minimize data transfer time between CPU and GPU.
- Increase workers.
Data transfer from CPU to GPU
Data never gets directly copied to the pageable memory but there's a pin memory in between. To reduce this work, we can directly use this pin memory to write our data.
DataLoader(training_data, batch_size=128, pin_memory=True)
Request for the pinned memory can fail and there will be an increase in memory usage.
Refer to the blog post by nvidia to understand more.
DataLoader by default uses a single worker for execution.
So if we increase the number of workers, pytorch creates additional processes to handle multiple dataset samples asynchronously.
DataLoader(training_data, batch_size=128, num_workers=8)