After taking a quick refresher in statistics (using wikipedia) and running the OP's numbers, I want to apologize to the OP. His analysis is spot on. His results fall outside the predictable margin of error. I focused on the sample size of his test rather than the analysis of that sample and for that I apologize.
But as you said, his is only one test. And to conclude that the system is broken based on that one test is irresponsible. More testing is required to come to a viable conclusion. So, at the end of the day, to an extent, we are right back where we started...the "sample size" argument in fact still applies