Matching Methods for High-Dimensional Data with Applications to Text

Date
-
Event Sponsor
The Munro Lectureship Fund and The Lane Center
Location
Encina Hall West, Room 400 (GSL)
Speaker

Brandon Stewart, Assistant Professor of Sociology, Princeton University

 

Abstract

Matching is a technique for preprocessing observational data to facilitate causal inference and reduce model dependence by ensuring that treated and control units are balanced along pre-treatment covariates. We identify situations where matching with thousands of covariates is desirable, particularly when pre-treatment covariates are measured from text. However, traditional matching approaches were designed for relatively small numbers of covariates and are ineffective in high dimensions. We propose a conceptually simple solution: estimate and match on a low-dimensional summary of the covariates to improve balance in high dimensions. Under this framework, we develop Topical Inverse Regression Matching (TIRM), a method that balances a low-dimensional projection of text-derived covariates. We illustrate by estimating the effect of censorship on the writing of Chinese bloggers, the effects of perceptions of author gender on citation counts in academia, and the effect of Usama bin Laden’s death on the popularity of his writings.

 

Biography
Brandon Stewart is an Assistant Professor of Sociology at Princeton University where he is also affiliated with the Politics Department, the Office of Population Research and the Center for the Digital Humanities.  He develops new quantitative statistical methods for applications across computational social science.  He completed my PhD in Government at Harvard in 2015 where he had the good fortune of working with the interdisciplinary group at IQSS.  He also earned a master's degree in Statistics from Harvard in 2014. 
 
He has worked extensively on methods for automated text analysis and with Justin Grimmer published an introduction to the field.  Brandon Stewart, Molly Roberts, and Dustin Tingley have developed the Structural Topic Model, an unsupervised topic model geared towards inference in the social sciences. The accompanying software stm is available on CRAN and at structuraltopicmodel.com.  It also includes a full vignette demonstrating its use.
 
Stewart has recently been working on Latent Factor Regressions which provide a general framework for modeling dependent data. The framework covers numerous data types including grouped/multilevel, time-series cross-sectional, spatial and network data, all with a single approach. While previous proposals in the literature can take days to estimate a single model, estimation under my framework often takes less than a second.  He will release an R package implementing these new methods.