Logo image
A Scalable System for Neural Architecture Search
Conference proceeding

A Scalable System for Neural Architecture Search

Jeff Hajewski and Suely Oliveira
2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp.0053-0060
01/2020
DOI: 10.1109/CCWC47524.2020.9031181

View Online

Abstract

Building reliable systems for neural architecture search requires careful design consideration due to the high computational demands coupled with the necessity of fault-tolerance. In this domain, it is not uncommon for applications to crash due to GPU memory exhaustion, which makes fault-tolerance and even more important attribute of a distributed neural architecture search system. We propose an RPC-based system that is robust to node failures and provides elastic compute abilities, allowing the system to add or remove computational resources as needed. The system is demonstrated on the task of neural architecture search for image classification using the CIFAR-10 dataset. Our system achieves near linear scaling and is robust to multiple GPU node failures, allowing the failed nodes to restart and rejoin.
artificial intelligence distributed deep learning distributed system neural architecture search RPC

Details

Metrics

29 Record Views
Logo image