Supervised Clustering of Features Via Implicit Network
Speaker: Anand Vidyashankar, George Mason University
Abstract: High dimensional regression models are used in a variety of applications, and identification of groups of predictors that are significantly associated with the response variable facilitates further downstream data analysis and decision making. This presentation describes a new data analysis method that utilizes the high-dimensional regression model to construct an implicit network on the predictor space. Using a population model for clusters, defined via network-wide metrics, we describe a new supervised clustering algorithm that determines the actual cluster with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure is described to assess uncertainty in the estimates of network-wide metrics, which substantially decreases the computational complexity. The proposed methods take into account several challenges that arise in the high-dimensional data sets, such as (i) a large number of predictors, (ii) an unknown true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated using numerical experiments, data from sports analytics, and the study of breast cancer. (Joint work with Brandon Park and Tucker McElroy).