Often, I will read a theory paper, internalize its insights, and later find myself applying them to generally different examples as the original paper did. It’s common for theory papers to have leading examples that can contextualize the result, but the wonderful thing about good theory is that these models usually have a lot more flexibility than just one example. Furthermore, studying different scenarios where a model could apply might raise important questions about what assumptions need to be relaxed in order to apply it to that scenario—as will be true for the paper discussed in this post. That could motivate an extension of the model, in a positive feedback loop that I think characterizes the best theory research. The modelling creates real-world implications that can be studied, and that study raises questions that drives the development of a new model.
This exercise—interpreting theoretical models in contexts outside their main application—occupies an ambiguous niche in the academic categorization of work. It is genuinely eye-opening to see how the basic force of a model can operate in many different contexts: it gives a good sense for whether the model is valuable or not. On the other hand, there is no new model being formed, or a substantively new insight being generated: this is not a contribution to the theory by any meaningful definition, so it doesn’t fit well as a paper. That makes a humble blog like this one a good outlet for this exercise. The next few posts will be examples of nice applications that I wanted to share.
The Paper: A/B Testing with Fat Tails
I first encountered A/B Testing with Fat Tails by Azevedo et al. (2019) when it was presented in a guest lecture for my information economics seminar. The context is simple: online platforms increasingly decide on policies to implement through “A/B tests”, a fancy name for simple randomized experiments on users of the platform. The platform is split, so that half of users receive the policy and the other half do not. The policy is then tested for whether it significantly improved these metrics over the baseline. For example, in 2008, the Obama campaign A/B tested different buttons to sign people up for receiving campaign emails: they found that a “Learn More” button boosted signups by 18% compared to the default “Sign Up” button that the campaign had been using. Combine prominent success stories like this one with the dizzying scope for policies that can be A/B tested on platforms, and it’s no wonder that A/B tests have become an integral tool for platforms to test their innovations.
Experimental resources (users to see the changes) are finite. For genuine treatment/control split, a user can only be a subject of one A/B test at a time: thus, there is a tradeoff between having more users in an A/B test (and thus having higher statistical power) and conducting more A/B tests (and thus discovering more potential innovations). Thus, the natural question is: what is the optimal experimentation policy for conducting A/B tests? How should platforms negotiate this tradeoff posed by limited capacity to test?
Optimal experimentation is a well-trodden literature, but the authors’ main contribution is to allow for the distribution of innovation quality to have fat tails, so that a small mass of innovations can have significantly higher quality than the average innovation. Eduardo Azevedo told us anecdotes about talking to Microsoft engineers A/B testing features on Bing, and being told that most of the features they tested had minor impact, but a small number of features generated such sharp revenue increases that they were initially mistaken for bugs in the revenue numbers! That motivated the authors to think about fat-tailed distributions as important considerations for optimal experimentation.
The paper’s findings are quite intuitive: when the distribution of innovation quality is thin-tailed, it is optimal to conduct a small number of experiments with a large number of subjects for each. In contrast, when the distribution of innovation quality is fat-tailed, the optimal policy conducts many experiments with a smaller subject pool for each. This result is neat, and to me, its intuitive quality suggests that it reflects a deeper fact that goes beyond the leading example of experimentation in online platforms.
The Connection: Fat tails in poverty alleviation policies
This paper came to mind in a completely different context, when listening to Rachel Glennerster talk about the cost-effectiveness of different educational policies.
“…when we looked at the cost effectiveness of education programs, there were a ton of zeros, and there were a ton of zeros on the things that we spend most of our money on. So more teachers, more books, more inputs, like smaller class sizes – at least in the developing world – seem to have no impact, and that’s where most government money gets spent.”
“But measurements for the top ones – the most cost effective programs – say they deliver 460 LAYS per £100 spent ($US130). LAYS are Learning-Adjusted Years of Schooling. Each one is the equivalent of the best possible year of education you can have – Singapore-level.”
This characterization argues for the presence of fat tails in education innovations: a large number of them are barely cost-effective, but the most cost-effective innovations are orders of magnitude above the mean. Furthermore, the two settings are strikingly similar, which is somewhat obscured by the fact that an experiment in one setting is called an A/B test and in the other setting is called an RCT. In both settings, a decisionmaker has the capacity to run experiments to determine the optimal policies to adopt. In both settings, the decisionmaker is constrained to trade off between the size of an experiment and how many experiments can be supported. Thus, we can neatly import the paper’s results into the new setting of randomized evaluations for education policies.
What Azevedo et al suggest, then, is that education policymakers should run RCTs to evaluate a large number of experimental educational policies, even that means each RCT is small-scale: that grants the highest expected reward, because a larger number of evaluations has a higher chance of unearthing that fat tail and discovering truly transformative educational interventions. This is especially exciting when considering that Glennerster’s characterization might hold true for areas other than educational policy. If agricultural policy has fat-tailed innovations, then this offers a useful framework for agricultural departments to consider how to test agricultural policies. In any area where program evaluations are conducted through RCTs, this paper’s result holds. Overall, I think this paper has really exciting implications for how policymakers should pilot and evaluate poverty alleviation policies, which could potentially produce huge benefits to society.
Of course, theoretical models have their assumptions that must be respected, and one wrinkle in this argument is that Azevedo et al assume that experimentation is costless, and the decisionmaker’s constraint is a limited number of subjects. This is true in their setting, where a policy is simply a software implementation. In contrast, a realistic model of educational policies might suggest that budget cost constrains the government more than the number of school students in a country. It is not obvious whether this substantially changes the result: a cost constraint for each innovation can be mapped onto a limit on how big a program evaluation can be. This is not as clean as a constraint solely on the number of subjects, because it means that the cost of an experiment is not solely a function of how many subjects it has: that could potentially spoil the results. However, from my (non-expert) reading of the proofs, it seems like the result should hold for costly interventions with unlimited subject pools.
If my intuition is wrong and it is non-trivial to extend the paper’s results to this new assumption set, such an extension might even be a paper in the making—another reason why this whole exercise is valuable to think about.
Azevedo, Eduardo M, Deng Alex, Jose Montiel Olea, Justin M Rao, and E Glen Weyl. 2019. “A/B Testing with Fat Tails.” Available at SSRN 3171224.