Agreements with and Counterarguments to “Response of Critique of Dream Investigation Results”

Photoexcitation

January 8, 2021

Executive Summary: The Minecraft Speedrunning Team wrote a Report that concluded that a subset of six livestreams had low probability events so extreme to establish that Dream’s games were inappropriately modified. Photoexcitation provided a Dream-commissioned Cri- tique of this Report that disagreed on a few key points and concluded that the numbers were not as extreme as proposed, though still with very low odds. The Minecraft Speedrunning Team has written a Response or rebuttal to the Photoexcitation Critique claiming multiple issues with the calculations. This current report, partially commissioned by Dream, provides counterarguments to the Response that support many of the conclusions of the original Cri- tique. It also identifies many areas of agreement including that the low probabilities are very strong arguments for the hypothesis that Dream cheated. Incorporating the arguments from the Minecraft Speedrunning Team Response, a new estimate is provided of about 1 in 100 million odds that any livestreaming speedrunner would experience the low probability effects seen on Dream’s six livestreams.

1 Introduction

After my] “Critique of Dream Investigation Results”, associated with a Response video from Drean{*| the Minecraft Speedrunning Team have responded (the “Response’ pdf). They viewed several aspects of my report unfavorably. I do not argue or believe that their original Report or the Response is unfair or biased, but hope that discussion and dialogue will help identify the most robust conclusions.

This response was partially commissioned by Dream, but mostly represents my desire to better explain my original Critique, to defend myself, and to help maintain the credibility of Photoexcitation.

2 Areas of Agreement

As I wrote in my original report, “Probability calculations are hard. There may not be one ‘right’ way to do something. It is easy to violate some hidden or unknown assumption. There is room for healthy debate about different methods and results.” There will likely be areas of this debate where we don’t fully reach mutually agreeable conclusions. However, there are many areas where we can agree. As everyone attempts to draw some reasonable conclusions and to speed the discussion along, I’ve decided to start with areas of agreement.

I won’t speak for the Minecraft Speedrunning Team, but here are some conclusions that I believe that they and I would agree on.

1 The discussion boards have been abuzz with questions about my identity, which is protected by Photoexcitation’s anonymity policy. I am the same author as the original Critique. Dream knew more about my identity that I revealed in the original report or that can be found on Photoexcitation’s website (https: //www. photoexcitation. com). Ido hold a degree from Harvard and use advanced statistics as an astrophysicist. I wish to maintain my anonymity, though I understand that for some this introduces the possibility that my credentials cannot be vetted.

10. 11.

12.

13.

. Even in the best-possible interpretation, the probability that any speedrunner had any streak of luck

as strong as Dream’s is extremely low, providing significant evidence that the ender pearl and/or blaze rod drop probabilities were modified, resulting in a very strong argument for the hypothesis that Dream cheated.

Anyone who claimed that I said “Dream did not cheat” was severely misrepresenting what I wrote. My report was focused on providing an additional probability calculation and the difference in our final numbers was large, but the difference between luck of 99.9999% vs. 99.9999999999% is only meaningful for people who are already quite inclined to believe Dream’s claims that he did not cheat. My Critique also raised the possibility that the probabilities were modified, but without malicious intent.

Drawing conclusions about whether Dream cheated are unlikely to be substantively different even if there are discrepancies of a factor of 10 or 100 in the probability calculations. Anything smaller than a factor of about 10 may not be worth detailed investigation given the range of numbers under consideration.

. Mixing in other runs from Dreams significantly increases the probability of an unmodified game, but

is not directly relevant to the main hypothesis that Dream cheated during this particular sequence of runs. Deciding which subset of runs to investigate after noticing that a particular subset were lucky requires some kind of correction. The original Report clearly provides such a correction [it was never my intention to imply otherwise] by imagining that all possible substreaks were investigated, which increases the probability by a factor of 10-100.

Including other streams isn’t entirely meaningless, as it does establish that Dream’s probabilities were not modified for all of his livestreams. This conclusion is separate from any conclusion about the six streams in question.

When accounting for other possible random numbers that might have been investigated, going from 10 to 37 only provides an increase of abouf?]a factor of 15. Not all of these are obvious methods for cheating or methods that could have been as easily investigated, so using a correction of 37 x 36 = 1332 is an overestimate. My Critique used 1000, while the original MST Report used 90.

One major difference in probability is due to the comparison of the number of speedrunners/stream- s/runs.

The Sampling Bias Corrections that I originally argued aren’t appropriate to lucky streaks were actually fine. The arguments in this section of my Critique were weak, distracting, and had the effect of overcriticizing the original Report. Even so, since any disagreement on these corrections was not used in my probability calculations, the conclusions of my Critique are unaffected.

The Bonferroni correction is a reasonable way to correct for p-hacking and other after-the-fact correc- tions.

The Bonferroni and (Dunn-)Siddék corrections are equivalent when the p-values are very small.

We currently disagree on whether the “Barter Stopping” correction should be used and whether it was already accounted for in their original Stopping model.

The sum of negative binomial distributions is distributed as a negative binomial distribution. The Barter Stopping model I proposed in my Critique is effectively a combination of (sums of) negative binomial distributions.

I mixed Bayesian and frequentist concepts in my Critique.

I again reiterate that I have not discussed these points with the Minecraft Speedrunning Team, but that I believe that they would agree with me on the above points, based on their reports. I provide these

3Instead of 10 x 9, the factor would be 37 x 36, leading to an increase of 37x36 _ 14.8

10x9

“agreed-upon” conclusions in the interest of expediency and with the goal of more quickly concluding our back-and-forth discussion.

I think it is cool that this discussion rose to the level of being posted on Andrew Gelman’s famous blog. :) One positive outcome was influencing large numbers of people to think more about math and statistics. Hopefully, highlighting areas of agreement will help people realize that there is much objectivity in these calculations.

I now turn to four major areas of discussion: barter stopping, mixing Bayesian and frequentist methods, lucky streak probabilities, and corrections for number of streamers.

3 Is Barter Stopping accounted for in the original MST Report?

My Critique raised the point that Barter Stopping is arguably a higher fidelity representatior{‘| of what really happened in most of the speed runs than the Binomial Model used in the original MST Report. My Barter Stopping simulation is effectively a combination (depending on whether it takes 2 or 3 barters to reach 10 ender pearls) of negative binomial distributions. The Response claims that the original Report’s binomial model with arbitrary stopping condition (from their Appendix B) already accounts for this. However, the Response focuses on showing that the sum of negative binomials is distributed as a negative binomial (which I agree with). This shows that the per-barter stopping criterion basically washes out to just being a single stopping criterion at the end, and I’ve done simulations which agree with this result.

I argue that there is an additional key point here: the original report uses the binomial distribution where arguably a negative binomial distribution should be used. Despite the similar names and related formulas, these are not identical distributions. Their Response asserts but does not prove that their original Stopping Criterion is enough to overcome the issue of using a different distribution. Further, the Response does not show that my model is an inappropriate choice.

In my Bayesian analysis, I perform both the barter-stopping criterion (equivalent to the negative binomial distribution) and the binomial distribution for ender pearlq)| and find that the former leads to a posterior probability higher by a factor of 60 even without applying a stopping criterion on the last run (which increases the probability further). In the original Report, using their Stopping Rule increases the probability by only a factor of 2. While these are clearly apples and oranges, I do not feel that the Response has demonstrated that their original Stopping Criterion with a Binomial Model is equivalent or comparable or more favorable than a Negative Binomial Model. My results are that the Barter Stopping Model (effectively a combination of Negative Binomial Models) is substantively different from the Binomial Model with the implemented Stopping Rule.

Another quick test is to calculate the cumulative distribution function evaluated for the negative binomial distribution for ender pearls. For 42 ender pearls and <262 barters, the negative binomial CDF is 6.7 x 107!°, an order of magnitude higher that their “best-case” stopping criterion and similar to my 3 x 107!°. For 262 barters and 42 or more successes, the cumulative probability is 9.7 x 10~', again larger than their “best-case” values. In my opinion, this provides more evidence that the Response has not demonstrated that their Stopping Criterion appropriately accounts for barter stopping which is arguably more appropriate for the probabilities at hand.

All that said, different modeling methods only move the needle by a factor of 10-100 and don’t change the overall conclusion that the odds that any speedrunner in any set of streams experienced such an unusual outcome is extremely low.

4 Is mixing Bayesian and frequentist methods allowed?

I admitted above to mixing Bayesian and frequentist methods, but I disagree that this makes the answers “uninterpretable” or that Bayesian analysis can not, does not, or need not account for bias corrections.

4 Actually, I look at each run separately and assign it to Barter Stopping or Binomial, which is even higher fidelity than assuming one model for all the runs.

5One criticism of my model is that the way I generate values for ender pearls cannot produce 8 ender pearls, even though this is seen in actual play. As I stated, “variations in this model were not significant,” which I validated by testing various distributions, including one that gave 4-8 ender pearls. Note that the choice of never giving 8 ender pearls causes the probabilities to be lower (less favorable to Dream), so I see little reason to criticize this choice.

While I agree that it would be better to perform the entire calculation within one probabilistic paradigm, that does not mean that the results of my analysis are invalid or uninterpretable. While I have not performed an end-to-end full Bayesian analysis, I have good reasons to suspect that these analyses are in the regime where the Bonferroni /Sidak corrections that we both use are appropriate in either the Bayesian or the frequentist paradigm. Bias corrections must be included and I propose that they can be applied to a posterior probability from a Bayesian analysis or a p-value from a frequentist analysis. Or, if you like, you can imagine my Bayesian analysis a different way of computing a probability that is then interpreted in the frequentist paradigm.

Bayesian methods are susceptible to p-hacking and including information on how and why the data were gathered is appropriate, e.g., That article quotes from the authoritative Bayesian Data Analysis textbook by Gelman et al. that it is erroneous to claim that “because all inference in conditional on the observed data, it makes no difference how those data are collected, ..., the essential flaw in the argument is that a complete definition of ’the observed data’ should include infomration on how the observed values arose ....” I believe that the Response’s claims in Section 7.1 fall mostly into this category and argue that including the corrections in my Critique is allowed and interpretable.

5 Lucky Streak Probabilities

Keeping in mind that my Critique was written in a very short time, my initial investigations in early versions of the article were concerned with whether lucky streaks were appropriately accounted for. Most of the online arguments with my Critique are focused on this section, which I admit is weak and missing some details. As I continued to study, this point became less important and then disconnected with the rest of the analysis. But I thought I had found some results that were interesting to the discussion which I left in. Honestly, I did not spend as much time triple-checking this part of the document because it was irrelevant to the final numbers. However, this section still criticizes the original MST Report, which was unnecessary and for which I apologize. In retrospect, I should have either toned down this section or removed it entirely.

One major focus has been on the coin flips question. As it seems like a simple test, I can see why disagreement on this question would be of concern. First, in my code, I accidentally calculated the probability of a streak of 19 heads in a row/}| Other possible issues are how to deal with streaks that are longer than the desired streak length. There are some good pedagogical statistics and probability discussions here, but in the interest of focusing on what is most important, ’m going to skip to the end and say I made a mistake. I will also apologize to the MST team for implying that the sampling bias correction methodology was inappropriate. Especially when I end up using the Bonferroni-style corrections myself in the final analysis. And especially when the differences are comparable to the factor of 2 that I propose be ignored.

Making this mistake has resulted one strongly-worded reddit post to declare me an “amateur” and “un- reliable.” While mistakes on basic calculations aren’t confidence-boosting, I would propose that identifying a single weak point in a paper that is unconnected to the rest of the analysis and then concluding that the entire paper is untrustworthy (without identifying specific issues on other aspects) is itself unprofessional. Even peer-reviewed journal articles are not held to this standard. Especially when considering the nature of the mistake: a minor error on an unimportant point during an analysis completed in a very short amount of time.

Instead, I feel that the appropriate conclusions on this mistake are that:

1. That entire section of the Critique should be ignored. This gives strength to the original MST Report. My apologies again for including this section.

2. Including this section in the Critique gives concern that there are errors in the rest of the Critique. I don’t think that’s a particularly strong argument for saying “the original Report is completely correct and the Critique is completely wrong.” I think it is more appropriate to then scrutinize each argument, especially those that are most important. In this regard, the Response identifies what I would consider to be the key points and hence those are the ones that I focus on in this document.

Of course, the reader is free to draw their own conclusions.

6For those unfamiliar with programming, I’ll mention that such mistakes are relatively common, e.g.,|https://en.wikipedia.

org/wiki/Off-by-one_error, This was a mistake and not intentional.

6 Corrections for Number of Streamers/Streams/Runs

Let’s quickly review how the bias corrections for the number of streamers, streams, and runs was done. Since Dream’s runs were investigated specifically because they appeared lucky, it is important to provide a bias correction. Using a Bonferroni/ Sidék correction, the final probability for Dream’s runs are multiplied by the number of comparable different possible investigations that could have been done. In the original Report, this was given as 1000 streamers (see Equation 13) and 66 possible consecutive subsets of Dream’s 11 streams (see Equation 12) that could have been investigated (for ender pearls only, which I agree with). In my Critique, I proposed instead considering all possible (consecutive) streams, using an estimate of 300 livestreamed speedruns per day or, effectively 100000 total. I also mention a correction for choosing the length of the streams, to give another factor of 10.

One main difference is whether you count by first choosing a speedrunner and then studying a subset of their streams or whether you count the subsets of all streams. I’m not sure there’s an obvious reason to prefer one over the other in terms of how you determine what all the possible subsets that could have been investigated. In fact, they probably would give similar numbers if we assumed similar numbers of speedrunners. ..even as it is, the total correction is 66,000 vs 1,000,000 which is only a factor of 15. I agree that this is one of the important numbers for my probability being higher than theirs.

The Response gives two very good points in estimating the number of livestreaming speedrunners. First, they point out that the number of livestreamers is not a “steady state” but is rapidly growing so that my as- sumption of typical leaderboard entry ages of one month probably overestimates the number of speedrunning livestreams. That’s a great point. Even more meaningful is the idea that they present that when considering the sampling bias, each speedrunner need not be considered as a “binary” in or out, but that the correction should be weighted by the probability that the livestreaming speedrunner would be investigated. That is an excellent point. Most speedrunners, even those that livestream, would not have been investigated at this level of scrutiny. I think this deserves decreasing the odds by a factor of 10-100. On the other hand, the Response seems to be focus on 1.16 speedruns specifically, whereas the conclusions (in the original Report and in mine) are on all Minecraft speedrunning. So, leaving it at all Minecraft speedrunning and integrating over all past years, ll lower my number of meaningful streams to 10,000 and draw a new conclusion that the odds of any small subset of any livestreamed speedrunner ever receiving as low a probability as Dream is 1 in 100 million. Note that our reports now basically agree on how large of a correction to apply to go from Dream’s six runs to all possible subsets of all speedruns.

T’ll note here for the layperson that when we say “that any speedrunner was ever this lucky” it feels natural to conclude that Dream’s odds of being this lucky were much worse than this. That is not a correct conclusion because it “undoes” the very real sampling bias that Dream specifically was investigated because his runs seemed very “lucky.” You could also write this as 1 in 100 million odds of Dream ever receiving as low a probability accounting for the fact that he was investigated because he seemed too lucky.

7 Conclusion

While some of the issues of the Minecraft Speedrunning Team Response to my original Critique were valid, I disagree with their assessment on others. After including their considerations, especially of the number of speedrunners to compare to, I re-evaluate the odds from 1 in 10 million to about 1 in 100 million and still think that an upper board of 1 in 7.5 trillion is too strong. But either way, the probabilities were almost certainly modified and this provides very strong evidence that Dream cheated.