lasasfabric.blogg.se - Pes 2017 online cpu tackling

PES 2017 ONLINE CPU TACKLING OFFLINE

Also, FPGAs allow designers to develop modular IP cores allowing for easier prototyping with the option to selectively deploy design areas at runtime without risk to the overall application. For their power-efficient performance and highly parallelised flexible architecture, FPGAs presented themselves as a viable option for hard real-time computation of heavy, deep learning applications.

PES 2017 ONLINE CPU TACKLING OFFLINE

CPUs and GPUs have been prominent for executing CNNs on offline training settings however, their energy efficiency and low throughput have made them less attractive for embedded use. Recent literature shows a clear demand for embedded deep learning solutions for hardware-constrained designs and novel compression techniques. A 2.8×, 5.8×, and 3× speed up over GPU was achieved on three architectures trained on MNIST, SVHN, and CIFAR-10 respectively. GPUs consistently outperform FPGAs in training times in batch processing scenarios, but in data stream scenarios, FPGA designs achieve a significant speedup compared to GPU and CPU when enough resources are dedicated to the learning task. We validate the results using the Zynq-7100 on three datasets and varying size architectures against CPU and GPU implementations. Meaning, we do not use the standard array of processing elements (PEs) approach, which is efficient for offline inference, instead we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available. The implementation uses a streaming interface that lends itself well to data streams and live feeds instead of static data reads from memory.

We translate the overlap into hardware by reusing most of the forward pass (FP) pipeline reducing the resources overhead. The training pipeline is generated based on the backpropagation (BP) equations of convolution which highlight an overlap in computation. These designs that trade-off resources for throughput allow users to tailor implementations to their hardware and applications. We automatically generate multiple hardware designs from high-level CNN descriptions using a multi-objective optimization algorithm that explores the design space by exploiting CNN parallelism.

In this work, we present an automated CNN training pipeline compilation tool for Xilinx FPGAs. Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning has become essential for the future deployment of CNNs on autonomous systems.