Past

Doubly Dividing the Massive Data for Prediction Using Model Aggregation

Abstract: This article considers the prediction accuracy for the conditional mean of response given predictors by using the model aggregation approach. However, the massive data, featuring with high dimension as well as huge sample size, usually cannot be stored in a single machine, making its analysis and prediction challengeable. We propose a distributed gridding model aggregation approach to overcoming the storage limitation of a single machine and the curse of high dimension and meanwhile enhancing the prediction accuracy for the massive data with high-dimensional linear regression model. Specifically, on each local machine that stores data with relative moderate sample size, we develop the model aggregation approach by splitting predictors wherein a greedy algorithm is proposed. To obtain the optimal model weights crossing all local machines, we further design a distributed and communication-efficient algorithm which only requires to solve a shifted and penalized quadratic loss function on the master machine, compute the gradient of loss function on each local machine and transfer it to the master one. Our procedure effectively distributes the workload and dramatically reduces communication costs. Theoretically, we establish that, within constant rounds of communications, the proposed method can match the prediction error bound of practically infeasible oracle method with access to the full sample, and explicitly express the convergence rate in terms of local sample size and communication rounds. Extensive experiments on both simulated and real world datasets are carried out to back up the theory, demonstrating the encouraging performances of the proposed method.