Preprint
Asynchronous and Distributed Data Augmentation for Massive Data Settings
ArXiv.org
Cornell University
09/18/2021
DOI: 10.48550/arXiv.2109.08969
Abstract
Data augmentation (DA) algorithms are widely used for Bayesian inference due
to their simplicity. In massive data settings, however, DA algorithms are
prohibitively slow because they pass through the full data in any iteration,
imposing serious restrictions on their usage despite the advantages. Addressing
this problem, we develop a framework for extending any DA that exploits
asynchronous and distributed computing. The extended DA algorithm is indexed by
a parameter $r \in (0, 1)$ and is called Asynchronous and Distributed (AD) DA
with the original DA as its parent. Any ADDA starts by dividing the full data
into $k$ smaller disjoint subsets and storing them on $k$ processes, which
could be machines or processors. Every iteration of ADDA augments only an
$r$-fraction of the $k$ data subsets with some positive probability and leaves
the remaining $(1-r)$-fraction of the augmented data unchanged. The parameter
draws are obtained using the $r$-fraction of new and $(1-r)$-fraction of old
augmented data. For many choices of $k$ and $r$, the fractional updates of ADDA
lead to a significant speed-up over the parent DA in massive data settings, and
it reduces to the distributed version of its parent DA when $r=1$. We show that
the ADDA Markov chain is Harris ergodic with the desired stationary
distribution under mild conditions on the parent DA algorithm. We demonstrate
the numerical advantages of the ADDA in three representative examples
corresponding to different kinds of massive data settings encountered in
applications. In all these examples, our DA generalization is significantly
faster than its parent DA algorithm for all the choices of $k$ and $r$. We also
establish geometric ergodicity of the ADDA Markov chain for all three examples,
which in turn yields asymptotically valid standard errors for estimates of
desired posterior quantities.
Details
- Title: Subtitle
- Asynchronous and Distributed Data Augmentation for Massive Data Settings
- Creators
- Jiayuan ZhouKshitij KhareSanvesh Srivastava
- Resource Type
- Preprint
- Publication Details
- ArXiv.org
- DOI
- 10.48550/arXiv.2109.08969
- ISSN
- 2331-8422
- Publisher
- Cornell University
- Language
- English
- Date posted
- 09/18/2021
- Academic Unit
- Statistics and Actuarial Science
- Record Identifier
- 9984288737702771
Metrics
72 Record Views