Sanghamitra Dutta, Carnegie Mellon University
Machine Learning using Unreliable Components: From Matrix Operations to Neural Networks and Stochastic Gradient Descent
Reliable computation at scale is one key challenge in large-scale machine learning today.Unreliability in computation can manifest itself in many forms, e.g. (i) "straggling" of a few slow processing nodes which can delay your entire computation, e.g., in synchronous gradient descent; (ii) processor failures; (iii) "soft-errors," which are undetected errors where nodes can produce garbage outputs. My focus is on the problem of training using unreliable nodes.
First, I will introduce the problem of training model parallel neural networks in the presence of soft-errors. This problem was in fact the motivation of von Neumann's 1956 study, which started the field of computing using unreliable components. We propose "CodeNet", a unified, error-correction coding-based strategy that is weaved into the linear algebraic operations of neural network training to provide resilience to errors in every operation during training. I will also survey some of the notable results in the emerging area of "coded computing," including my own work on matrix-vector and matrix-matrix products, that outperform classical results in fault-tolerant computing by arbitrarily large factors in expected time. Next, I will discuss the error-runtime trade-offs of various data parallel approaches in training machine learning models in presence of stragglers, in particular, synchronous and asynchronous variants of SGD. Finally, I will discuss some open problems in this exciting and interdisciplinary area.
Parts of this work is accepted at AISTATS 2018 and ISIT 2018.
Machine Learning Coffee seminars are weekly seminars held jointly by the Aalto University and the University of Helsinki. The seminars aim to gather people from different fields of science with interest in machine learning. Talks will begin at 9:15 am and porridge and coffee will be served from 9:00 am.
Welcome!
Last updated on 13 Apr 2018 by Teemu Roos - Page created on 13 Apr 2018 by Teemu Roos