AI accelerator hardware like Google’s Tensor Processing Units and Intel’s Nervana Neural Network Processor promise to speed up AI model training, but because of the way the chips are architected, earlier stages of the training pipeline (like data preprocessing) don’t benefit from the boosts. That’s why scientists at Google Brain, Google’s AI research division, propose in a paper a technique called “data echoing,” which they say reduces the computation used by earlier pipeline stages by reusing intermediate outputs from these stages. According to the researchers, the best-performing data echoing algorithms can match the baseline’s predictive performance using less upstream processing, in some cases compensating for a four times slower input pipeline.
“Training a neural network requires more than just the operations that run well on accelerators, so we cannot rely on accelerator improvements alone to keep producing speedups in all cases,” observed the coauthors.
“A training program may need to read and decompress training data, shuffle it, batch it, and even transform or augment it. These steps may exercise multiple system components, including CPUs, disks, network bandwidth, and memory bandwidth.”
In a typical training pipeline, the AI system first reads and decodes the input data and then shuffles the data, applying a set of transformations to augment it before gathering examples into batches and iteratively updating parameters to reduce error. The researchers’ data echoing approach inserts a stage in the pipeline that repeats the output data of the previous stage before the parameters update, theoretically reclaiming idle compute capacity.
In experiments, the team evaluated data echoing on two language modeling tasks, two image classification tasks, and one object detection task using AI models trained on open source data sets. They measured training time as the number of “fresh” training examples required to reach a target metric, and they investigated whether data echoing could reduce the number of examples needed.
The coauthors report that in all but one case, data echoing required fewer fresh examples than the baseline and reduced training. Furthermore, they note that the earlier echoing is inserted in the pipeline — i.e., before data augmentation compared with after batching — the fewer fresh examples were needed, and that echoing occasionally performed better with larger batch sizes.