Publications by author | Cowles Foundation for Research in Economics

Abstract

Algorithms make a growing portion of policy and business decisions. We develop a treatment-effect estimator using algorithmic decisions as instruments for a class of stochastic and deterministic algorithms. Our estimator is consistent and asymptotically normal for well-defined causal effects. A special case of our setup is multidimensional regression discontinuity designs with complex boundaries. We apply our estimator to evaluate the Coronavirus Aid, Relief, and Economic Security Act, which allocated many billions of dollars worth of relief funding to hospitals via an algorithmic rule. The funding is shown to have little effect on COVID-19-related hospital activities. Naive estimates exhibit selection bias.

Abstract

We obtain a necessary and sufficient condition under which random-coefficient discrete choice models, such as mixed-logit models, are rich enough to approximate any nonparametric random utility models arbitrarily well across choice sets. The condition turns out to be the affine-independence of the set of characteristic vectors. When the condition fails, resulting in some random utility models that cannot be closely approximated, we identify preferences and substitution patterns that are challenging to approximate accurately. We also propose algorithms to quantify the magnitude of approximation errors.

Abstract

What happens if selective colleges change their admission policies? We study this question by analyzing the world’s first implementation of nationally centralized meritocratic admissions in the early twentieth century. We find a persistent meritocracy-equity tradeoff. Compared to the decentralized system, the centralized system admitted more high-achievers and produced more occupational elites (such as top income earners) decades later in the labor market. This gain came at a distributional cost, however. Meritocratic centralization also increased the number of urban-born elites relative to rural-born ones, undermining equal access to higher education and career advancement.

Abstract

Dynamic treatment regimes (DTRs) are sequences of decision rules that recommend treatments based on patients’ time-varying clinical conditions. The sequential, multiple assignment, randomized trial (SMART) is an experimental design that can provide high-quality evidence for constructing optimal DTRs. In a conventional SMART, participants are randomized to available treatments at multiple stages with balanced randomization probabilities. Despite its relative simplicity of implementation and desirable performance in comparing embedded DTRs, the conventional SMART faces inevitable ethical issues, including assigning many participants to the empirically inferior treatment or the treatment they dislike, which might slow down the recruitment procedure and lead to higher attrition rates, ultimately leading to poor internal and external validities of the trial results. In this context, we propose a SMART under the Experiment-as-Market framework (SMART-EXAM), a novel SMART design that holds the potential to improve participants’ welfare by incorporating their preferences and predicted treatment effects into the randomization procedure. We describe the steps of conducting a SMART-EXAM and evaluate its performance compared to the conventional SMART. The results indicate that the SMART-EXAM can improve the welfare of the participants enrolled in the trial, while also achieving a desirable ability to construct an optimal DTR when the experimental parameters are suitably specified. We finally illustrate the practical potential of the SMART-EXAM design using data from a SMART for children with attention-deficit/hyperactivity disorder.

Abstract

Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. To deal with this problem, previous studies assume either independent or cascade user behavior, resulting in some ranking versions of IPS. While these estimators are somewhat effective in reducing the variance, all existing estimators apply a single universal assumption to every user, causing excessive bias and variance. Therefore, this work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior. Moreover, AIPS achieves the minimum variance among all unbiased estimators based on IPS. We further develop a procedure to identify the appropriate user behavior model to minimize the mean squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments demonstrate that the empirical accuracy improvement can be significant, enabling effective OPE of ranking systems even under diverse user behavior.

Abstract

Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper thus studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic. Note that complete version with technical appendix is available on arXiv: http://arxiv.org/abs/2211.13904.

Abstract

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.

Abstract

We obtain a necessary and sufficient condition under which random-coefficient discrete choice models such as the mixed logit models are rich enough to approximate any nonparametric random utility models across choice sets. The condition turns out to be very simple and tractable. When the condition is not satisfied and, hence, there exists a random utility model that cannot be approximated by any random-coefficient discrete choice model, we provide algorithms to measure the approximation errors. After applying our theoretical results and the algorithms to real data, we find that the approximation errors can be large in practice.

Abstract

Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of enabling realistic and reproducible OPE research, we present Open Bandit Dataset, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN. Our dataset is unique in that it contains a set of multiple logged bandit datasets collected by running different policies on the same platform. This enables experimental comparisons of different OPE estimators for the first time. We also develop Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE. Our open data and software will contribute to fair and transparent OPE research and help the community identify fruitful research directions. We provide extensive benchmark experiments of existing OPE estimators using our dataset and software. The results open up essential challenges and new avenues for future OPE research.

Abstract

Efficient methods to evaluate new algorithms are critical for improving interactive bandit and reinforcement learning systems such as recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop an alternative method, which predicts the performance of algorithms given historical data that may have been generated by a different algorithm. Our estimator has the property that its prediction converges in probability to the true performance of a counterfactual algorithm at a rate of, as the sample size N increases. We also show a correct way to estimate the variance of our prediction, thus allowing the analyst to quantify the uncertainty in the prediction. These properties hold even when the analyst does not know which among a large number of potentially important state variables are actually important. We validate our method by a simulation experiment about reinforcement learning. We finally apply it to improve advertisement design by a major advertisement company. We find that our method produces smaller mean squared errors than state-of-the-art methods.

Abstract

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators’ accuracy on a narrow set of hyperparameters and evaluation policies. Therefore, we cannot know which estimator is safe and reliable to use in general practice. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators’ sensitivity to the choice of hyperparameters and possible changes in evaluation policies in an interpretable manner. We also build open-source Python software, pyIEOE, to streamline the evaluation with the IEOE protocol. With this software, researchers can use IEOE to compare different OPE estimators in their research, and practitioners can select an appropriate estimator for the given practical situation. Then, using the IEOE procedure, we perform extensive re-evaluation of a wide variety of existing estimators on public datasets. We show that, surprisingly, simple estimators that have fewer hyperparameters are more reliable than other advanced estimators because advanced estimators need environment specific hyperparameter tuning to perform well. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.

Abstract

Many centralized school admissions systems use lotteries to ration limited seats at oversubscribed schools. The resulting random assignment is used by empirical researchers to identify the effects of schools on outcomes like test scores. I first find that the two most popular empirical research designs may not successfully extract a random assignment of applicants to schools. When are the research designs able to overcome this problem? I show the following main results for a class of data-generating mechanisms containing those used in practice: The first-choice research design extracts a random assignment under a mechanism if the mechanism is strategy-proof for schools. In contrast, the other qualification instrument research design does not necessarily extract a random assignment under any mechanism. The former research design is therefore more compelling than the latter. Many applications of the two research designs need some implicit assumption, such as large-sample approximately random assignment, to justify their empirical strategy.

Abstract

Democracy is widely believed to contribute to economic growth and public health in the 20th and earlier centuries. We find that this conventional wisdom is reversed in this century, i.e., democracy has persistent negative impacts on GDP growth during 2001-2020. This finding emerges from five different instrumental variable strategies. Our analysis suggests that democracies cause slower growth through less investment and trade. For 2020, democracy is also found to cause more deaths from Covid-19.