Georges' STAT DAY 2024

Home  |  Schedule  |  Keynote Address  |  Panel Discussion  |  Abstracts  |  Location

Abstracts

 

Keynote Talk

Title: How a "Community-of-the Whole" Approach Informs Our Road to the 2030 Census

Speaker: Robert L. Santos, Director of the U.S. Census Bureau

Abstract:

Although the U.S. Census Bureau’s decennial census of the population takes place over a relatively short period, preparations begin years in advance. Among the first results delivered are a data file used for federal reapportionment of congressional seats, followed by files state governments use to redraw their congressional districts and many more detailed tabulations relied on widely by communities, institutions, and every level of commerce.  To ensure the success of the 2030 census, Director Santos is leading the Bureau through organizational initiatives that prioritize integrity, innovation, and community engagement. Robust analyses of censuses are deep, informative, and made public in order to ensure transparency and to solicit public comment on future plans. Continuous engagement with diverse voices invites communities to not only to participate in a once-in-a-decade count but to seek input that can strengthen every stage from early planning to publication. The road to 2030 has begun and is revealing fresh insights, new expectations, and emerging data needs. Evaluations of the last census (Census Evaluations and Experiments), research plans to support innovations in the next (2030 Census Research Project Explorer), and initial plans (2030 Census Planning) are all posted publicly on the Bureau’s website. Transparency aligns with core values of integrity, innovation, and engagement to achieve the most participatory census in history, measured not just by participation rates but by the broadest, practicable engagement throughout the decade in every phase of its design.

 

Title: Estimating Latent Group Structure in Functional Semiparametric Regression Models

Speaker: Tatiyana Apanasovich, Department of Statistics, George Washington University

Abstract:

In this talk, we discuss a functional semiparametric regression model with latent group structures to accommodate nontrivial heterogeneous relationships between a functional response and functional covariates. The proposed modeling and estimation framework combines the ideas of k-means clustering, local kernel regression, functional principal component analysis and EM algorithm. We establish the consistency of the proposed estimators, derive their convergence rates and asymptotic distributions. Simulation studies are carried out to examine the finite sample properties of the proposed methods.  The work is motivated by the problem of modelling the macroeconomic determinants of carbon dioxide emissions.

 

Title: What can a bloom date tell us about climate change?

Speaker: Jonathan Auerbach, Department of Statistics, George Mason University

Abstract:

The law of the flowering plants states that a plant blooms after being exposed to a predetermined quantity of heat. The law is among the oldest statistical discoveries still in use today, stated by Réaumur in the eighteenth century, popularized by Quetelet in the nineteenth century, and used to reconstruct historic temperatures and study climate change in the twentieth and twenty-first centuries. But a recent body of literature has called into question whether bloom dates are in fact a reliable measure of historic temperatures.

In this talk, I will reexamine the law of the flowering plants using results from renewal theory. I will first challenge evidence suggesting that bloom dates are not a reliable measure of historic temperatures. I will then show that popular methods for temperature reconstruction likely overestimate the difference between past climates and the climates of today. Finally, I will conclude by presenting a model for reconstructing temperatures from bloom dates.

 

Title: Detection of Structural Breaks in Non-stationary Spatial Random Field 

Speaker: Pramita Bagchi, Department of Biostatistics and Bioinformatics, George Washington University

Abstract:

We propose a method for investigating structural breaks in a non-stationary spatial random field observed over a regular grid. We work in a frequency domain set-up and propose a statistic based on the maximal difference between local spatial spectral density with maximum taken over locations and range of frequencies. We establish the theoretical properties of this proposed statistic and use that to construct a consistent asymptotic level α test for the stationarity hypothesis. Further, this statistic provides a visual tool to understand the nature of non-stationarity present in the data. We use this visual tool, called disparity map, along with the theoretical properties of this statistic to construct a piece-wise stationary approximation of the observed random field where the pieces are rectangular regions. An initial partition is constructed using sequential application of the proposed test for stationarity. A hierarchical clustering algorithm is then used to determine the optimal number of regions and to merge the obtained partition appropriately to produce a final partition. In this paper, we present a computationally efficient implementation of our methodology. The accuracy and performance of our proposed methods is demonstrated via extensive simulations and two case studies using climate data.

 

Title: Subgroup Analyses in Clinical Trials based on The Desirability of Outcome Ranking (DOOR)

Speaker: Weixiao Dai, Department of Biostatistics and Bioinformatics, George Washington University

Abstract:

Interpretation of clinical trials outcome is a great challenge as the effects of the interventions have different aspects. There may be both beneficial and harmful effects. Traditionally, each efficacy and safety outcome are assessed separately, by evaluating the overall effect of each endpoint independently. However, summing marginal analyses of each outcome does not effectively characterize the overall effects on patients. Thus, DOOR is introduced to address the difficulty in benefit and risk assessment.

The Desirability of Outcome Ranking (Evans et al., 2015; Evans and Follmann, 2016), is a paradigm which aims to resolve these challenges. It begins with development of DOOR, which categorizes each trial participant with respect to their overall clinical outcome, considering both efficacy and safety, as an ordinal ranking from 1 to K. The total number of DOOR rank K is subject to clinical decisions of how many efficacy and safety components investigators would like to include. The overall effect of intervention can be evaluated by comparing the distributions of the DOOR between intervention and control group. A summary measure for treatment effect based on DOOR is DOOR probability (Evans and Follmann, 2016), which is the probability that a randomly selected patient from one treatment group has a more desirable outcome that a randomly selected patient from another treatment group. An unbiased estimator of DOOR probability is Wilcoxon-Mann-Whitney U statistic, the closed form of the CI estimate is also well studied.

Since DOOR based analysis has exhibit inspiring behaviors in multiple registrational trials (Howard-Anderson et al, 2022 - cUTI, Israel), we would like to see whether it can bring new insight into subgroup analyses. Subgroup analyses has been widely applied to different disease areas as it brought integrity and cautiousness to clinical trials. When overall treatment effect is significant, subgroup analyses are recommended to ensure that different subgroups of patients can benefit from the intervention at the same level. Subgroup analyses can also be beneficial when overall treatment effect is not significant, expecting potentially significant treatment effect within subgroups.

Thus, it is encouraging to apply DOOR analyses to subgroups. We propose subgroup identification methods that can conduct subgroup analyses based on the ordinal outcome, DOOR, using DOOR probability to address the issue that most current literature only covers continuous or dichotomized outcome. We revise STEPP (subpopulation treatment effect pattern plot) and CART based tree methods, estimate subgroup treatment effects with model free and model-based methods and examine subgroup treatment effect heterogeneity. We use real data example to illustrate the methodology, and eventually provide R package and guidance on how to conduct subgroup analyses using DOOR as endpoint.

 

Title: Entity Extraction and its Enhancement in Natural Language Processing

Speaker: Muzhe Guo, Department of Statistics, George Washington University

Abstract:

Utilizing data from the "covid19positive" subreddit, we employed natural language processing (NLP) to automatically identify COVID-19 cases and extract their reported symptoms. Firstly, we trained a Bidirectional Encoder Representations from Transformers (BERT) classification model with chunking to identify COVID-19 cases. Then, we developed a novel QuadArm model, which incorporates Question Answering, Dual-corpus Expansion, Adaptive Rotation Clustering, and Mapping, to extract symptoms. To further enhance the above framework, two novel techniques we proposed can be incorporated in the future: (1) A Bayesian Iterative Prediction algorithm, which iteratively refines likelihood and prior probabilities until prediction labels converge, thereby enhancing the accuracy and robustness of classification models. (2) the Multiple Synonymous Questions BioBERT, which integrates question augmentation, rather than the typical single question used by traditional BioBERT, to elevate BioBERT’s performance on medical QA tasks.

 

Title: Ancestral Inference for Branching Process in Random Environments

Speaker: Xiaoran Jiang, Department of Statistics, George Mason University

Abstract:

Ancestral inference for Branching processes in random environments (BPRE) is concerned with the inference regarding the parameters of the ancestor distribution generating the process. In this presentation, we describe a new generalized method of moments methodology for inference using replicated BPRE data. Even though the evolution of the process strongly depends on the offspring means of various generations, we establish that the joint limiting distribution of the ancestor and the offspring estimators mean, under appropriate centering and scaling, decouple and converge to independent normal random variables when the ratio of the number of generations to the logarithm of the number of replicates converge to zero. We also provide estimators for the limiting variance and illustrate our results using numerical experiments and data from Polymerase Chain Reaction (PCR) experiments.

 

Title: Graphical Measures Summarizing the Inequality of Income of Two Groups

Speaker: Joshua Landon, Department of Statistics, George Washington University

Abstract:

The substantial increase in economic inequality in favor of the upper income group in the United States and many other developed and developing nations during the past 30 years has become a major concern in public policy. In this talk, modifications of the standard measures of inequality, the Lorenz curve and Gini index, are proposed that better reflect the decline in the share of income received by the poor and middle portions of the income distribution relative to the upper end. A pair of curves based on the fractions of either the middle or lower portion of the income curve that has the same share as the top u% and the areas between them and the line of equality are introduced. The proposed curves and related measures indicate that a noticeably greater change in the U.S. income distribution occurred during the 1967–2023 time period than are observed in the Lorenz curve and Gini index. These curves can then be adapted and extended to provide analogous curves comparing the relative status of two groups. Now one calculates the proportion of the minority group, cumulated from the bottom or middle needed to have the same total income as the top qth fraction of the majority group (after adjusting for sample size). The areas between these curves and the line of equality are analogous to the Gini index. The methodology is used to illustrate the change in the degree of inequality between males and females, as well as between black and white males, in the United States between 2000 and 2023.

 

Title: A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed Models

Speaker: Seiyon Ben Lee, Department of Statistics, George Mason University

Abstract:

Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or sparse covariance/precision matrices) require posterior sampling via Markov chain Monte Carlo (MCMC), which can be prohibitive for large datasets. While variational Bayes (VB) have been extended to SGLMMs, their focus has primarily been on smaller spatial datasets. In this study, we propose two computationally efficient VB approaches for modeling moderate-sized and massive (millions of locations) Gaussian and discrete non-Gaussian spatial data. Our scalable VB method embeds semi-parametric approximations for the latent spatial random processes and parallel computing offered by modern high-performance computing systems. Our approaches deliver nearly identical inferential and predictive performance compared to 'gold standard' methods but achieve computational speedups of up to 1000x. We demonstrate our approaches through a comparative numerical study as well as applications to two real-world datasets. Our proposed VB methodology enables practitioners to model millions of non-Gaussian spatial observations using a standard laptop within a short timeframe.

 

Title: Flexible Basis Representations for Modeling Large Non-Gaussian Spatial Data

Speaker: Remy MacDonald, Department of Statistics, George Mason University

Abstract:

Nonstationary and non-Gaussian spatial data are prevalent across many fields (e.g., counts of animal species, disease incidences in susceptible regions, and remotely-sensed satellite imagery). Due to modern data collection methods, the size of these datasets have grown considerably. Spatial generalized linear mixed models (SGLMMs) are a flexible class of models used to model nonstationary and non-Gaussian datasets. Despite their utility, SGLMMs can be computationally prohibitive for even moderately large datasets. To circumvent this issue, past studies have embedded nested radial basis functions into the SGLMM. However, two crucial specifications (knot placement

and bandwidth parameters), which directly affect model performance, are typically fixed prior to model-fitting. We propose a novel approach to model large nonstationary and non-Gaussian spatial datasets using adaptive radial basis functions. Our approach: (1) partitions the spatial domain into subregions; (2) employs reversible-jump Markov chain Monte Carlo (RJMCMC) to infer the number and location of the knots within each partition; and (3) models the latent spatial surface using partition-varying and adaptive basis functions. Through an extensive simulation study, we show that our approach provides more accurate predictions than competing methods while preserving computational efficiency. We demonstrate our approach on two environmental datasets - incidences of plant species and counts of bird species in the United States.

 

Title: Longitudinal Benefit: Risk Analysis Through the Desirability of Outcome Ranking (DOOR)

Speaker: Richard Shu, Department of Biostatistics and Bioinformatics, George Washington University

Abstract:

In clinical trials, statisticians try to understand the treatment effects as well as safety of different interventions. However, analyzing the two aspects separately could end up with misleading results. The desirability of outcome ranking (DOOR) method addresses this issue and provides a patient centric approach to benefit:risk evaluation. A patient’s outcome is ranked based on pre-specified clinical criteria, where the most desirable rank represents a good outcome with no side effects and the least desirable rank is often a terminal event. The Mann-Whitney-U statistic is often used to compare the DOOR outcomes between two treatment arms.

We propose a longitudinal version of DOOR that estimates and infers about the temporal treatment effects, and we demonstrate its efficiency through simulation results. Hypothetical data is generated to resemble real world problems. The DOOR outcome is set to be non-improving as time goes on to reflect cardiovascular or oncology trials, while it is not restricted for the Covid scenario, i.e patients can be cured and get infected again later. A Monte-Carlo approach is used to construct simultaneous confidence bands and a weighted Mann-Whitney-U statistic is used to evaluate treatment effect over all timepoints.

 

Title: Sparse Longitudinal Omics Data Analysis using Gaussian Random Function Prior

Speaker: Ali Taheriyoun, Department of Biostatistics and Bioinformatics, George Washington University

Abstract:

n an omics dataset consisting of hundreds of taxa to millions of genes, considering the intra-subject correlations per each variable alongside the sparse replications increase both the variation and computational cost. To overcome this, we consider the longitudinal observations per each subject as a sparse realization such that the response variable given the fixed effects follows a normal, Poisson or negative binomial (NB) distribution where the prior for the mean of these distributions is considered as a zero-mean Gaussian random function or its exponential with unknown kernel function. The Bayesian settings for normal and Poisson responses have been studied previously and we introduce an efficient setting for the NB case to describe the overdispersion property of the abundances. The prior settings for the hyper-parameters of the kernel and a method for kernel selection are described. We also explore combining different kernels through addition and multiplication to capture local behaviors of paths. The method is examined on gut microbiomes and CD4 count data. The analysis and visualizations are available in waveome, a Python package in https://github.com/omicsEye.

 

Title: Independence-Encouraging Subsampling for Nonparametric Additive Models

Speaker: Yi Zhang, Department of Statistics, George Washington University

Abstract:

Additive models, typically fitted with the backfitting algorithm, offer flexibility while avoiding the curse of dimensionality, a challenge faced by many nonparametric models. This approach, however, faces high computational cost and convergence issues with large datasets. We propose a novel method, independence-encouraging subsampling (IES), to efficiently select a subsample from big data for training additive models. Inspired by the minimax optimality of orthogonal arrays (OA), which feature pairwise independent predictors and marginally uniform coverage of predictor range, IES aims to replicate these OA properties in the subsample. Our asymptotic analyses demonstrate that the IES subsample not only effectively approximates an OA, but also ensures the convergence of the backfitting algorithm, even with high predictor dependence in full data. The efficiency of IES is validated through simulations and a real data application.

 

Posters

Title: deepBreaks: a machine learning tool for identifying and prioritizing genotype-phenotype associations

Presenter: Mahdi Baghbanzadeh, Department of Biostatistics and Bioinformatics, George Washington University

Abstract:

Sequence data, such as nucleotides or amino acids, play a crucial role in advancing our understanding of biology. However, investigating and analyzing sequencing data and genotype-phenotype associations present several challenges, including non-independent observations, noise components, nonlinearity, collinearity, and high dimensionality. To address these challenges, machine learning (ML) algorithms are well-suited as they can capture nonstructural patterns and genotype-phenotype associations. Yet, there is a lack of user-friendly ML implementations that leverage the unique features of high-volume DNA sequence data. In this context, we introduce deepBreaks, a versatile approach that identifies important positions in sequence data correlating with phenotypic traits. deepBreaks compares the performance of multiple ML algorithms and prioritizes positions based on the best-fit models. It is an open-source software with online documentation available at https://github.com/omicsEye/deepBreaks.

 

Title: A General Framework for Inference After Record Linkage

Presenter: Priyanjali Bukke, Department of Statistics, George Mason University

 

Title: Applying Multinomial Fisher-Rao Distances to Microbiome Abundance    

Presenter: Clark Gaylord, Department of Biostatistics and Bioinformatics, George Washington University

Abstract:

Viewing abundance observations of microbiome samples as multinomial observations, a natural distance metric on the Riemannian manifold of density functions is the Fisher-Rao metric. Unlike Bray-Curtis dissimilarity, the Fisher information results in a true distance. We apply this novel metric to microbiome samples of mice with Loeys-Dietz Syndrome (LDS) vs control, using multidimensional scaling to extract the most informative subspace of the NxN distance matrix, constructing a statistical test for difference in microbiome abundance between cohorts. This technique can be applied to abundance at any level of taxa or any other application where the base observation is viewed as realization of a multinomial random vector.

 

Title: Smoothed Quantile Regression for Spatial Data      

Presenter: Jilei Lin, Department of Statistics, George Washington University

Abstract:

In this paper, we develop computationally efficient methods for making inference on quantile spatial partially linear varying coefficient model (QSM) over complex domains by convolution smoothing of the check loss. In spatial data analysis, QSM is a useful and flexible model for characterizing the complex conditional regression relation between covariates and response at different quantile levels, where the effects are allowed to vary across different lo- cations or to be fixed. We propose a convolution Smoothed Quantile partially linear model estimator using Bivariate smoothing over Triangulations (SQBiT). Under some regularity conditions, the proposed SQBiT estimator can attain an optimal convergence rate under L2-norm. A central limit theorem is established for the constant effect estimator. Based on it, we come up with an asymptotic 1 − α confidence interval. For small samples, we propose an interval estimator based on wild bootstrap. Through simulation studies, we demonstrate that the SQBiT estimator achieves substantial computational and estimation improvement over its unsmoothed counterpart. Additionally, we show that the intervals based on wild bootstrap and normal approximation achieve the nominal confidence level in small and large samples, respectively. Last, we illustrate the proposed method on a mortality dataset.

 

Title: A Nonparametric Bayesian Model of Citizen Science Data for Monitoring Environments Stressed by Climate Change

Presenter: Ruishan Lin, Department of Statistics, George Mason University

Abstract:

We propose a new method to adjust for the bias that occurs when citizen scientists monitor a fixed location and report whether an event of interest has occurred or not, such as whether a plant has bloomed. The bias occurs because the monitors do not record the day each plant first blooms at the location, but rather whether a certain plant has already bloomed when they arrive on site. Adjustment is important because differences in monitoring patterns can make local environments appear more or less anomalous than they actually are, and the bias may persist when the data are aggregated across space or time. To correct for this bias, we propose a nonparametric Bayesian model that uses monotonic splines to estimate the distribution of bloom dates at different sites. We then use our model to determine whether the lilac monitored by citizen scientists in the northeast United States bloomed anomalously early or late, preliminary evidence of environmental stress caused by climate change. Our analysis suggests that failing to correct for monitoring bias would underestimate the peak bloom date by 32 days on average. In addition, after adjusting for monitoring bias, several locations have anomalously early bloom dates that did not appear anomalous before adjustment.

 

Title: Clustering of High Dimensional Observations

Presenter: Yong Wang, Department of Statistics, George Washington University

Abstract:

We present a novel clustering method for high dimensional, low sample size (HDLSS) data. The method is distance-based, takes advantage of the concentration phenomenon and uses the limiting values of the dissimilarity indices to construct clusters.  We describe an algorithm that orders each row of the dissimilarity matrix to estimate the change points, which define cluster boundaries. Based on the clusters found in each row, we construct an agreement matrix of the Rand indices of the row clusters. The minimum of the row sum of the agreement matrix provides us with the best clusters. We prove that the new method achieves perfect clustering as the number of features diverges for fixed sample size. Several examples are presented to illustrate the proposed clustering method. We compare the new method with eight other clustering techniques, including high dimensional k-means, minimal spanning tree and Hierarchical Scan, in HDLSS setting.

 

Title: Joint Modeling of Interval Censored Adenoma Data and Informative Screening to Predict Risk of Advanced Adenoma

Presenter: Yipeng Wei, Department of Statistics, George Washington University

Abstract:

Recurrent interval-censored (panel count) data is one of the common forms of screening data in epidemiology studies. In the context of colorectal cancer screening, our work is focused on the prediction of the probability of advanced adenoma and risk factors assessment for colorectal cancer. The approach involves non-stationary Poisson process for adenoma and informative screening events, with a semi-parametric Cox model correlated by a latent frailty variable. The study employs the non-parametric Turnbull algorithm for estimating the cumulative distribution function of the intensity function and utilizes the borrow-strength method for estimating subject-specific latent frailty. Marginal prediction model and frailty prediction model are proposed based on the availability of patient’s screening history.

 

Title: A New Family of Covariate-Adjusted Response-Adaptive Randomization Procedures for Precision Medicine

Presenter: Jiaqian Yu, Department of Statistics, George Washington University

Abstract:

In most clinical trials, patients accrue sequentially and need to be assigned to different treatment groups. Previous studies of randomization procedures include complete randomization, restricted randomization, response-adaptive randomization (RAR), covariate-adaptive randomization (CAR) and covariate-adjusted response-adaptive randomization (CARA). With the development of precision medicine, information about biomarkers is usually available and should be included in the randomization procedure. In statistical analysis, biomarkers are mathematically treated as covariates, and we classify biomarkers into predictive and prognostic covariates according to their roles. Under this setting, we find the drawbacks of all the previous designs and propose a new family of CARA designs for precision medicine. The new family of CARA design integrates both covariates balance and target allocation that can not only assign more patients to better treatments based on the predictive covariates, but also balance prognostic covariates. More specifically, a new Weighted Balance Ratio (WBR) for prognostic covariates is defined within each strata of predictive covariates and is incorporated into Doubly-Adaptive Biased Coin Design (DBCD) method with Hu and Zhang’s allocation function. In this poster, we demonstrate the details of the new family of CARA design with some theoretical and simulation results.

 

Title: Benefit-Risk Evaluation for Diagnostics: A Framework (BED-FRAME) and Average Weighted Accuracy (AWA): Online Analysis Tool

Presenter: Shanshan Zhang , Department of Biostatistics and Bioinformatics, George Washington University