Yesterday I attended a seminar by a post-doc here at Berkeley, Aaditya Ramdas, on reproducibility in science and tech. It was an enlightning talk that not only easily explained two important topics of statistical hypothesis testing, A/B testing and p-value, but also opened my mind on a topic that many times I thought about but didn’t have neither the knowledge nor the ability to explain: sometimes science is right because scientists want it to be right. Motivated by that, I’m writing this blog post on a not-so-new-but-totally-noteworthy (and previously unknown to me) problem and how it affected one of the most known researches.
But let’s start from the beginning.
A/B testing is one of the most used techniques by researchers, tech companies, economists, designers and everyone else to test new features or improvements, understand if they are better than the previous one, and finally shipping the tested feature or publishing astounding discoveries.
Taking this awesome example from optimizely, we can summarize A/B testing in the following way. Imagine a website with a “buy now” button in grey and the designer that asks himself if a red-colored “buy now” button would induce more profits for the website. With A/B testing, you would have two different webpages and randomly, when a user visits the webpage, he is shown scenario A or scenario B. If after some time, the users in scenario B generated more profit, scenario B will become the new default webpage (it’s not that simple actually but this is the general idea).
This type of testing is based on the null hypothesis that, at the beginning of the experiment, states that the current webpage is better than the one we are testing. And this makes sense if we consider that we don’t have any data for this test and so, undoubtly, the one that we have up to now is the winner (or we can say, the state-of-the-art to which the new webpage has to fight against).
In order to accept scenario B or, in a more formal way, in order to declare the null hypothesis false and state that scenario B is actually better than scenario A, we have to be sure that the results obtained in scenario B are of statistical significance and this measure is evaluated through the p-value. Statistically significant means that the result did not occur by random chance and a p-value of, for example 0.05, means that, in the original scenario (scenario A), a result like the one that we saw or an even more extreme result was 5% probable.
I stole this image below from wikipedia and it can become very helpful with a numerical example. Let’s use again the same one as before: scenario A is a website with a grey “buy now” button and scenario B is the same website with a red “buy now” button. If on average the website (with scenario A) sells 1000$ of products a week and in that week, with scenario B it sold 4000$ of products, you go and check in the distribution of the amount of profit per week the probability of making 4000$ or more in a week. If that probability is 5% or less, it means that, if the null hypothesis would have been true, that result would have happened one time every 20 or more times.
I always use 0.05 (5%) as an example because it’s the most used threshold to understand if a result is of statistical significance.
But why this long and technical introduction? Because today I learnt that this statistical significance measure is not enough!
It is absolutely important to understand how the hypothesis is tested, how the data is collected and many other factors. For this reason, Simmons and Simonsohn wrote that is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis and that false-positives are easily discoverable in well-published researches.
They introduce the concept of p-hacking or, the ability to alter data collection in order to reach the important value of 0.05. How? Collecting data continuously until that point is reached, continuously analysing data but reporting only after p<0.05, excluding participants or transforming the data.
I already knew about A/B testing and the p-value concept but didn’t know about the existence of p-hacking and of the numerous amount of discoveries that were affected by it.
You certainly have seen Power-posing, your body language shapes who you are Ted Talk1. It’s the fourth most viewed Ted Talk at the time of writing and it has been posted millions of times in every possible feed. I first saw it almost one year ago, during a Personal Development and Team Leadership class. The professor was so proud of this video and its main theme: standing in a position of confidence/power, even if you are not confident, will boost your confidence and make you feel more powerful.
Even though I didn’t change nor my mindset nor my way of sitting or acting after watching the video, it was difficult not to think about it. I mean—come on—before an interview I can go in the bathroom, do the super power pose for two minutes and then I will nail the interview? Too easy right? I heard housemates and friends speaking about it, I heard my mom (that doesn’t even know how to switch on a computer) speaking about it and I read about many people confirming and supporting these poses.
Well, and now I’m getting to the point, turns out Dana Carney and Amy Cuddy performed, beyond their knowledge, p-hacking. In short, they were sure that the null hypothesis was wrong and tried to make it work experimentally with the data they had. This has been discovered when other researchers around the world were trying to reproduce the results of the experiments but couldn’t manage to get it right.
Dana Carney, one of the main authors of the publication, aknowledged the errors and published a document online writing:
We ran subjects in chunks and checked the effect along the way. It was something like 25 subjects run, then 10, then 7, then 5. Back then this did not seem like p-hacking. It seemed like saving money (assuming your effect size was big enough and p-value was the only issue). Some subjects were excluded on bases such as “didn’t follow directions.” The total number of exclusions was 5. The final sample size was N = 42.
She also wrote I don’t believe that “power poses” effects are real and I discourage others from studying power poses. These are strong positions that she took on this topic and I wish I informed myself the same day I saw the video.
I know I’m two years late, but the fact that I’m late suggests that lots of people could have seen the Ted Talk having no idea about the post-discussions that arised.2
This example is just one of the many possible false-positive researches that may get published every year, maybe even without the author themselves being conscious about it. Can you imagine how many publications, results, scientific discoveries may not be true? If one of the most famous Ted Talks is in this circle, how many others? Hundreds? Thousands?
A final jump brings me to another topic I thought much about in recent times: reproducibility. Lately I’ve been reproducing results from some scientific papers and after months and months of struggling I lost motivation and started thinking that the promised results are not true3. In general, though, should authors of scientific publications release datasets, source codes, and instructions on how to reproduce results? I think yes, for a more democratic scientific community and the main aim to reach the truth. I think that, although reproducing results is one of the best ways to learn and gain knowledge in a specific field, the improvements in speed and truthfulness in science would be enormous.
[UPDATE: Dec 11th, 2017]: I would like to suggest readers more interested into the power poses experiments to dive into the links in the bottom-notes. The (un)replicability of the results consists mainly on the hormone results and not on the self-reported feeling of increased levels of confidence which, indeed, seem to be replicable. This google doc documents results and comments papers that try to reproduce the self-reported feeling.