Actually the confidence interval calculation is what I us at work to validate math models with test results for products we build. At my job we say that if a test data falls outside the 90 percent confidence interval we say that it fails to validate the model.

Right, but are you modeling probabilistic events?

Confidence intervals are useful when we have a definite hypothesis and definite test results. They allow us to represent things like our level of confidence that our measurements are accurate, and how confident we are that our results point to an actual phenomenon and not just statistical noise.

Confidence intervals are also useful when we have definite results from a subset of some larger population and we want to extrapolate from them. We know with complete certainty how people in exit polls voted. Confidence intervals allow us to represent our confidence that these exit polling numbers are an accurate reflection of all the ballots cast.

In your case, it sounds like confidence intervals allow you to represent the confidence with which you can say that a rocket landed where your model said it would because the model is right, and not because of measurement error or expected variability or whatever.

In this case the programers are telling us 20 percent is the outcome we should see.

No, they aren't. I think that is precisely the problem here.

There is a very important difference between saying that something has a 20% chance to happen and saying that something will happen 20% of the time.

For the sake of illustration, let's say that we are going to reverse engineer five items, each with a 20% chance to teach us a new schematic. A simple probability table gives us the following percentage chances for each of the five possible outcomes:

0/5 Successfullly teach us a new schematic = 32.77%

1/5 Successfully teach us a new schematic = 40.96%

2/5 Successfully teach us a new schematic = 20.54%

3/5 Successfully teach us a new schematic = 5.12%

4/5 Successfully teach us a new schematic = 0.64%

5/5 Successfully teach us a new schematic = 0.032%

We don't have a definite hypothesis because our model is probabilistic. Our chance of learning exactly one new schematic from five reverse engineering attempts (a perfect 20% success rate) is less than half. It is the most probable of the five outcome - it should occur with greater frequency than any other individual outcome - but it is substantially less probable than all the other outcomes put together.

If your friend said to you "I am going to reverse engineer five items. I bet you I learn exactly one new schematic, no more and no less," you would be smart to bet against him. Your odds of winning are roughly 3:2: If you and your friend made the same bet over and over and over, you ought to win one and a half times as much as you lose.

It is not impossible to make an argument about probability by way of confidence intervals, but I think it is kind of a clunky way to do it.

Common sense tells us that if we perform four trials, and in each of the four trials we learn five new schematics from five reverse engineering attempts (results that we would expect to occur with a frequency of about 1/3000), we can have a high degree of confidence in our inference that something is probably biasing the results.

Things get a lot trickier when the results are less extreme though. How many times do you have to flip heads before you conclude that your coin is not working correctly?

Is a 13.4% success rate over 400+ trials evidence enough to conclude that the system is not working as it is supposed to? The answer to that question depends entirely on how much variance normally exists among measurements like this. The 99.7% rule says that 99.7% of all values in a normal distribution fall within three standard deviations of the mean. In other words, any data point which is more than three standard deviations from the mean is an extreme outlier and extremely unlikely to occur by mere chance. But to say that your results are three standard deviations from the mean, you need to know what the standard deviation for tests like this is. I have no idea what that would be, but my intuition suggests that 13.4% is probably within three of them.

This is admittedly outside my area of expertise, but if you do not know how much statistical variance normally exists across trials of that size I do not think it is even possible to make a meaningful claim about the significance of your results using a confidence interval calculation.