Press Releases

TABLA: A unified Template-based Framework for Accelerating Statistical Machine Learning

Description
TABLA: A unified Template-based Framework for Accelerating Statistical Machine Learning Mahajan et al., HPCA 2016 Presented by Akhila Rallapalli and Aditya Shah Outline Introduction Overview Background
Categories
Published
of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
TABLA: A unified Template-based Framework for Accelerating Statistical Machine Learning Mahajan et al., HPCA 2016 Presented by Akhila Rallapalli and Aditya Shah Outline Introduction Overview Background on Stochastic Gradient Descent Programming Interface Model Compiler for TABLA Design Builder and Template Designs Evaluation Conclusion Contributions FPGAs to accelerate stochastic optimization Machine Learning algorithms Generate accelerators abstracting hardware design details from the programmer 5 learning algorithms - logistic regression, SVM, recommender systems, back propagation and linear regression on two topologies Evaluated on Xilinx Zynq FPGA platform Stochastic Gradient Descent Objective: Minimize prediction error over entire training data Repeats the below calculation for every training point until function converges to minimum value TABLA - Overview High-level programming model Design builder Pre-designed templates Model Compiler Programming Interface High level language to represent learning algorithms Incorporates: data declarations and mathematical operations Data declarations to represent model parameters and training data Mathematical operations for numerical operations to compute gradient Implementing Gradient for Logistic Regression on TABLA G - gradient matrix with n rows and m columns X - input vector with m elements Y - expected output vector with n elements W - n x m model parameters Lambda - regularization factor Model Compiler for TABLA Statically generates an execution schedule for the accelerator Integrates gradient of objective function with stochastic gradient descent solver Generates DataFlow Graph (DFG) of the entire learning algorithm Translates DataFlow Graph into a static schedule for hardware execution Integrating Stochastic Gradient Descent Use stochastic gradient to learn model parameters from training data Integrate general template for stochastic gradient descent with programmer provided gradient code Variables for integration: gradient and model Generating Dataflow Graph Converts any code written into a dataflow graph Nodes represent basic computations and edges capture dependencies between them Constructs DFG for the learning algorithm by linking DFGs for each construct Static Scheduling Generates step-by-step schedule of each operation from the DFG Uses Minimum Latency-Resource Constrained Scheduling (ML - RCS) to generate this schedule Design builder generates skeleton of accelerator to determine number of available resources Design Builder Clustered Hierarchical Architecture DFG - Processing Units (PU) PU - Processing Engines (PE) #PE and #PU decided by DFG parallelism Scheduling + Design Scalable, general, customizable Template Design - PU Localizes majority of data traffic within PUs Single PU can carry out computation of entire learning algorithm Design builder scales #PUs based on DFG parallelism PUs connected by pipelined global bus Static scheduling of PUs = simplifies PU and bus design Template Design - PE Basic block of the template design Fixed - ALU, data/model buffer, registers, busing logic ALU - fixed but operations depend on DFG Highly customizable - Control, Nonlinear unit, neighbor comm links Control unit - Schedule of operations Nonlinear unit - Sigmoid, tanh etc (Not always necessary) Neighbor comm link - Data aggregation (sum or product) Immediate communication between neighbors = parallel data exchange Benchmarks Hardware Metrics Execution Time CPU and GPU - Wall clock time averaged over 100 runs FPGA - hardware counters synthesized in logic Power measurements Intel - Intel RAPL library Tesla K40, GTX 650 Ti - Nvidia NVML (GTX 650Ti from Tesla K40) ARM A15, Tegra K1 - Keysight Programmable Power supply ZYNQ - TI UCD9240 Programmable Power supply Resource Utilization Speedup (vs ARM A15) CPU GPU Perf per Watt CPU GPU Speedup (vs ARM A15) #PE Bandwidth Conclusions ML algorithms are Compute Intensive = benefit from acceleration FPGAs plagued by long development cycles TABLA = Framework to generate FPGA accelerators Leverages commonality of SGD in ML algorithms Average speedup of 2.9x over vectorized Intel Xeon, 19.4x over ARM A x higher Perf-per-Watt than Tesla K40, 20.2x higher than GTX 650 Ti, 17.57x higher than Tegra K1 Less than 50 lines of code Discussion How does TABLA compare against custom solutions such as Google TPU? Should it be the programmer s responsibility to manually calculate gradient of cost of increasingly complex ML algorithms? Is it worth it to restrict solving all optimization problems using SGD? Thank You!
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks