Linear Regression with Linked Data Files
Speaker: Emanuel Ben-David, Census Bureau
Abstract: Large organizations that own or have access to multiple data sources regularly rely on data integration for conducting large-scale scientific projects. Record linkage, or entity resolution, is an essential task in data integration. The task is to identify which records in different datasets belong to the same entity. In practice, due to the lack of unique identifiers, record linkage is prone to matching errors: false matches and missed matches. Statistical analysis of linked data files, even with low matching error, can then suffer from selection bias and adverse outliers. To adjust the analysis, it is of interest to develop statistical methods that can alleviate the adverse effects of matching errors. In this talk, I consider the regression analysis of “permuted data” in which the record linkage results in an unknown permutation of the observations for the response variable. Assuming that the matching error is small, I propose an approach for estimating the parameters that is statistically sound and computationally feasible.