Archive for the ‘Statistics’ Category

False positives for medical papers

Friday, February 8th, 2008

My previous two posts have been about false research conclusions and false positives in medical tests. The two are closely related.

With medical testing, the prevalence of the disease in the population at large matters greatly when deciding how much credibility to give a positive test result. Clinical studies are similar. The proportion of potential genuine improvements in the class of treatments being tested is an important factor in deciding how credible a conclusion is.

In medical tests and clinical studies,  we’re often given the opposite of what we want to know. We’re given the probability of the evidence given the conclusion, but we want to know the probability of the conclusion given the evidence. These two probabilities may be similar, or they may be very different.

The analogy between false positives in medical testing and false positives in clinical studies is helpful, because the former is easier to understand that the latter. But the problem of false conclusions in clinical studies is more complicated. For one thing, there is no publication bias in medical tests: patients get the results, whether positive or negative. In research, negative results are usually not published.

False positives for medical tests

Friday, February 8th, 2008

The most commonly given example of Bayes theorem is testing for rare diseases. The results are not intuitive. If a disease is rare, then your probability of having the disease given a positive test result remains low. For example, suppose a disease effects 0.1% of the population and a test for the disease is 95% accurate. Then your probability of having the disease given that you test positive is only about 2%.

Textbooks typically rush through the medical testing example, though I believe it takes a more details and numeric examples for it to sink in. I know I didn’t really get it the first couple times I saw it presented.

I just posted an article that goes over the medical testing example slowly and in detail: Canonical example of Bayes’ theorem in detail. I take what may be rushed through in half a page of a textbook and expand it to six pages, and I use more numbers and graphs than equations. It’s worth going over this example slowly because once you understand it, you’re well on your way to understanding Bayes’ theorem.

Most published research results are false

Thursday, February 7th, 2008

John Ioannidis wrote an article in Chance magazine a couple years ago with the provocative title Why Most Published Research Findings are False.  Are published results really that bad? If so, what’s going wrong? 

Whether “most” published results are false depends on context, but a large percentage of published results are indeed false. Ioannidis published a report in JAMA looking at some of the most highly-cited studies from the most prestigious journals. Of the studies he considered, 32% were found to have either incorrect or exaggerated results. Of those studies with a 0.05 p-value, 74% were incorrect.

The underlying causes of the high false-positive rate are subtle, but one problem is the pervasive use of p-values as measures of evidence.

Folklore has it that a “p-value” is the probability that a study’s conclusion is wrong, and so a 0.05 p-value would mean the researcher should be 95 percent sure that the results are correct. In this case, folklore is absolutely wrong. And yet most journals accept a p-value of 0.05 or smaller as sufficient evidence.

Here’s an example that shows how p-values can be misleading. Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!

The example above is exaggerated, but look at the JAMA study results again. In a sample of real medical experiments, 32% of those with “significant” results were wrong. And among those that just barely showed significance, 74% were wrong.

See Jim Berger’s criticisms of p-values for more technical depth.

Population drift

Friday, February 1st, 2008

The goal of a clinical trial is to determine what treatment will be most effective in a given population. What if the population changes while you’re conducting your trial? Say you’re treating patients with Drug X and Drug Y, and initially more patients were responding to X, but later more responded to Y. Maybe you’re just seeing random fluctuation, but maybe things really are changing and the rug is being pulled out from under your feet.

Advances in disease detection could cause a trial to enroll more patients with early stage disease as the trial proceeds. Changes in the standard of care could also make a difference. Patients often enroll in a clinical trial because standard treatments have been ineffective. If the standard of care changes during a trial, the early patients might be resistant to one therapy while later patients are resistant to another therapy. Often population drift is slow compared to the duration of a trial and doesn’t affect your conclusions, but that is not always the case.

My interest in population drift comes from adaptive randomization. In an adaptive randomized trial, the probability of assigning patients to a treatment goes up as evidence accumulates in favor of that treatment. The goal of such a trial design is to assign more patients to the more effective treatments. But what if patient response changes over time? Could your efforts to assign the better treatments more often backfire? A trial could assign more patients to what was the better treatment rather than what is now the better treatment.

On average, adaptively randomized trials do treat more patients effectively than do equally randomized trials. The report Power and bias in adaptive randomized clinical trials shows this is the case in a wide variety of circumstances, but it assumes constant response rates, i.e. it does not address population drift.

I did some simulations to see whether adaptive randomization could do more harm than good. I looked at more extreme population drift than one is likely to see in practice in order to exaggerate any negative effect. I looked at gradual changes and sudden changes. In all my simulations, the adaptive randomization design treated more patients effectively on average than the comparable equal randomization design. I wrote up my results in The Effect of Population Drift on Adaptively Randomized Trials.

Programming the last mile

Tuesday, January 29th, 2008

In any programming project there comes a point where the programming ends and manual processes begin. That boundary is where problems occur, particularly for reproducibility.

Before you can build a software project, there are always things you need to know in addition to having all the source code. And usually at least one of those things isn’t documented. Statistical analyses are perhaps worse. Software projects typically yield their secrets after a moderate amount of trial and error; statistical analyses may remain inscrutable forever.

The solution to reproducibility problems is to automate more of the manual steps. It is becoming more common for programmers to realize the need for one-click builds. (See Pragmatic Project Automation for a good discussion of why and how to do this.  Here’s a one-page summary of the book.) Progress is slower on the statistical side, but a few people have discovered the need for reproducible analysis.

It’s all a question of how much of a problem should be solved with code. Programming has to stop at some point, but we often stop too soon. We stop when it’s easier to do the remaining steps by hand, but we’re often short-sighted in our idea of “easier”. We mean easier for me to do by hand this time. We don’t think about someone else needing to do the task, or the need for someone (maybe ourselves) to do the task repeatedly. And we don’t think of the possible debugging/reverse-engineering effort in the future.

I’ve tried to come up with a name for the discipline of including more work in the programming portion of problem solving. “Extreme programming” has already been used for something else. Maybe “turnkey programming” would do; it doesn’t have much of a ring to it, but it sorta captures the idea.

Example of the law of small numbers

Friday, January 25th, 2008

The law of small numbers says that people underestimate the variability in small samples. Said another way, people overestimate what can be accomplished with a small study. Here’s a simple example. Suppose a drug is effective in 80% of patients. If five patients are treated, how many will respond?

Many people reason that 80% means 4 out of 5, so if 5 people are treated, exactly 4 will respond. Always.

Others understand that things are not guaranteed to work out so neatly, but they still believe that it is highly likely that 4 people would respond. Maybe a 90% chance.

In fact, there’s only a 41% chance that exactly 4 would respond out of a sample of 5.

Laws of large numbers and small numbers

Thursday, January 24th, 2008

In case my previous note on the law of small numbers confused anyone, I’ll compare it to the law of large numbers.

The law of large numbers is a mathematical theorem; the law of small numbers is an observation about human psychology.

The name “law of large numbers” is a standard term applied to a theorem about the convergence of random variables. (OK, actually two theorems. Like nuclear forces, the laws of large numbers comes in a strong and a weak form.)

The name “law of small numbers” is a pun, and I don’t believe the term is commonly used. Too bad. It’s a convenient label for a common phenomena.

The law of small numbers

Thursday, January 24th, 2008

The book Judgment under uncertainty analyzes common fallacies in how people estimate probabilities. The book asserts that no one has good intuition about probability. Statisticians do better than the general public, not because their intuition is much better, but because they know not to trust their intuition; they know they need to rely on calculations.

One of the common fallacies listed in the book is the “law of small numbers.” In general, people grossly underestimate the variability in small samples. This phenomena comes up all the time. It’s good to know someone has given it a name.

Selection bias and bombers

Monday, January 21st, 2008

During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their bombers. After analyzing the records, he recommended adding more armor to the places where there was no damage!

This seems backward at first, but Wald realized his data came from bombers that survived. That is, the British were only able to analyze the bombers that returned to England; those that were shot down over enemy territory were not part of their sample. These bombers’ wounds showed where they could afford to be hit. Said another way, the undamaged areas on the survivors showed where the lost planes must have been hit because the planes hit in those areas did not return from their missions.

Wald assumed that the bullets were fired randomly, that no one could accurately aim for a particular part of the bomber. Instead they aimed in the general direction of the plane and sometimes got lucky. So, for example, if Wald saw that more bombers in his sample had bullet holes in the middle of the wings, he did not conclude that Nazis liked to aim for the middle of wings. He assumed that there must have been about as many bombers with bullet holes in every other part of the plane but that those with holes elsewhere were not part of his sample because they had been shot down.

Thick tails

Friday, January 18th, 2008

Bart Kosko in his book Noise argues is that thick-tailed probability distributions such as the Cauchy distribution are common in nature. This is the opposite of what I was taught in college. I remember being told that the Cauchy distribution, a distribution with no mean or variance, is a mathematical curiosity more useful for constructing academic counterexamples than for modeling the real world. Kosko disagrees. He writes

… all too many scientists simply do not know that there are infinitely many different types of bell curves. So they do not look for these bell curves and thus they do not statistically test for them. The deeper problem stems from the pedagogical fact that thick-tailed bell curves get little or no attention in the basic probability texts that we still use to train scientists and engineers. Statistics books for medicine and the social sciences tend to be even worse.

We see thin-tailed distributions everywhere because we don’t think to look for anything else. If we see samples drawn from a thick-tailed distribution, we may throw out the “outliers” before we analyze the data, and then a thin-tailed model fits just fine.

How do you decide what’s an outlier? Two options. You could use your intuition and discard samples that “obviously” don’t belong, or you could use a formal test. But your intuition may implicitly be informed by experience with thin-tailed distributions, and your formal test may also circularly depend on the assumption of a thin-tailed model.

Literate programming and statistics

Tuesday, January 15th, 2008

Sweave, mentioned in my previous post, is a tool for literate programming. Donald Knuth invented literate programming and gives this description of the technique in his book by the same name:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

Knuth says the quality of his code when up dramatically when he started using literate programming. When he published the source code for TeX as a literate program and a book, he was so confident in the quality of the code that he offered cash rewards for bug reports, doubling the amount of the reward with each edition. In one edition, he goes so far as to say “I believe that the final bug in TeX was discovered and removed on November 27, 1985.” Even though TeX is a large program, this was not an idle boast. A few errors were discovered after 1985, but only after generations of Stanford students studied the source code carefully and multitudes of users around the world put TeX through its paces.

While literate programming is a fantastic idea, it has failed to gain a substantial following. And yet Sweave might catch on even though literate programming in general has not.

In most software development, documentation is an after thought. When push comes to shove, developers are rewarded for putting buttons on a screen, not for writing documentation. Software documentation can be extremely valuable, but it’s most valuable to someone other than the author. And the benefit of the documentation may only be realized years after it was written.

But statisticians are rewarded for writing documents. In a statistical analysis, the document is the deliverable. The benefits of literate programming for a statistician are more personal and more immediate. Statistical analyses are often re-run, with just enough time between runs for the previous work to be completely flushed from term memory. Data is corrected or augmented, papers come back from review with requests for changes, etc. Statisticians have more self-interest in making their work reproducible than do programmers.

Patrick McPhee gives this analysis for why literate programming has not caught on.

Without wanting to be elitist, the thing that will prevent literate programming from becoming a mainstream method is that it requires thought and discipline. The mainstream is established by people who want fast results while using roughly the same methods that everyone else seems to be using, and literate programming is never going to have that kind of appeal. This doesn’t take away from its usefulness as an approach.

But statisticians are more free to make individual technology choices than programmers are. Programmers typically work in large teams and have to use the same tools as their colleagues. Statisticians often work alone. And since they deliver documents rather than code, statisticians are free to use use Sweave without their colleagues’ knowledge or consent. I doubt whether a large portion of statisticians will ever be attracted to literate programming, but technological minorities can thrive more easily in statistics than in mainstream software development.

Irreproducible analysis

Tuesday, January 15th, 2008

Journals and granting agencies are prodding scientists to make their data public. Once the data is public, other scientists can verify the conclusions. Or at least that’s how it’s supposed to work. In practice, it can be extremely difficult or impossible to reproduce someone else’s results. I’m not talking here about reproducing experiments, but simply reproducing the statistical analysis of experiments.

It’s understandable that many experiments are not practical to reproduce: the replicator needs the same resources as the original experimenter, and so expensive experiments are seldom reproduced. But in principle the analysis of an experiment’s data should be repeatable by anyone with a computer. And yet this is very often not possible.

Published analyses of complex data sets, such as microarray experiments, are seldom exactly reproducible. Authors inevitably leave out some detail of how they got their numbers. In a complex analysis, it’s difficult to remember everything that was done. And even if authors were meticulous to document every step of the analysis, journals do not want to publish such great detail. Often an article provides enough clues that a persistent statistician can approximately reproduce the conclusions. But sometimes the analysis is opaque or just plain wrong.

I attended a talk yesterday where Keith Baggerly explained the extraordinary steps he and his colleagues went through in an attempt to reproduce the results in a medical article published last year by Potti et al. He called this process “forensic bioinformatics,” attempting to reconstruct the process that lead to the published conclusions. He showed how he could reproduce parts of the results in the article in question by, among other things, reversing the labels on some of the groups. (For details, see “Microarrays: retracing steps” by Kevin Coombes, Jing Wang, and Keith Baggerly in Nature Medicine, November 2007, pp 1276-1277.)

While they were able to reverse-engineer many of the mistakes in the paper, some remain a mystery. In any case, they claim that the results of the paper are just wrong. They conclude “The idea … is exciting. Our analysis, however, suggests that it did not work here.”

The authors of the original article replied that there were a few errors but that these have been fixed and they didn’t effect the conclusions anyway. Baggerly and his colleagues disagree. So is this just a standoff with two sides pointing fingers at each other saying the other guys are wrong? No. There’s an important asymmetry between the two sides: the original analysis is opaque but the critical analysis is transparent. Baggerly and company have written code to carry out every tiny step of their analysis and made the Sweave code available for anyone to download. In other words, they didn’t just publish their paper, they published code to write their paper.

Sweave is a program that lets authors mix prose (LaTeX) with code (R) in a single file. Users do not directly paste numbers and graphs into a paper. Instead, they embed the code to produce the numbers and graphs, and Sweave replaces the code with the results of running the code. (Sweave embeds R inside LaTeX the way CGI embeds Perl inside HTML.) Sweave doesn’t guarantee reproducibility, but it is a first step.

Irrelevant uncertainty

Monday, January 14th, 2008

Suppose I asked where you want to eat lunch. Then I told you I was about to flip a coin and asked again where you want to eat lunch. Would your answer change? Probably not, but sometimes the introduction of irrelevant uncertainty does change our behavior.

In a dose-finding trial, it is often the case that a particular observation has no immediate importance to decision making. Suppose Mr. Smith’s outcome is unknown. We calculate what the next dose will be if he responds to treatment and what it will be if he does not respond. If both doses are the same, why wait to know his outcome before continuing? Some people accept this reasoning immediately, while others are quite resistant.

Not only may a patient’s outcome be irrelevant, the outcome of an entire clinical trial may be irrelevant. I heard of a conversation with a drug company where a consultant asked what the company would do if their trial were successful. He then asked what they would do if it were not successful. Both answers were the same. He then asked why do the trial at all, but his question fell on deaf ears.

While it is irrational to wait to resolve irrelevant uncertainty, it is a human tendency. For example, businesses may delay a decision on some action pending the outcome of a presidential election, even if they would take the same action regardless which candidate won. I see how silly this is when other people do it, but it’s not too hard for me to think of analogous situations where I act the same way.

Musicians, drunks, and Oliver Cromwell

Saturday, January 12th, 2008

Jim Berger gives the following example illustrating the difference between frequentist and Bayesian approaches to inference in his book The Likelihood Principle.

Experiment 1:

A fine musician, specializing in classical works, tells us that he is able to distinguish if Hayden or Mozart composed some classical song. Small excerpts of the compositions of both authors are selected at random and the experiment consists of playing them for identification by the musician. The musician makes 10 correct guesses in exactly 10 trials.

Experiment 2:

A drunken man says he can correctly guess in a coin toss what face of the coin will fall down. Again, after 10 trials the man correctly guesses the outcomes of the 10 throws.

A frequentist statistician would have as much confidence in the musician’s ability to identify composers as in the drunk’s ability to predict coin tosses. In both cases the data are 10 successes out of 10 trials. But a Bayesian statistician would combine the data with a prior distribution. Presumably most people would be inclined a priori to have more confidence in the musician’s claim than the drunk’s claim. After applying Bayes theorem to analyze the data, the credibility of both claims will have increased, though the musician will continue to have more credibility than the drunk. On the other hand, if you start out believing that it is completely impossible for drunks to predict coin flips, then your posterior probability for the drunk’s claim will continue to be zero, no matter how much evidence you collect.

Dennis Lindley coined the term “Cromwell’s rule” for the advice that nothing should have zero prior probability unless it is logically impossible. The name comes from a statement by Oliver Cromwell addressed to the Church of Scotland:

I beseech you, in the bowels of Christ, think it possible that you may be mistaken.

In probabilistic terms, “think it possible that you may be mistaken” corresponds to “don’t give anything zero prior probability.” If an event has zero prior probability, it will have zero posterior probability, no matter how much evidence is collected. If an event has tiny but non-zero prior probability, enough evidence can eventually increase the posterior probability to a large value.

The difference between a small positive prior probability and a zero prior probability is the difference between a skeptical mind and a closed mind.

Unbiased estimators can be terrible

Saturday, January 12th, 2008

An estimator in statistics is a way of guessing a parameter based on data. An estimator is unbiased if over the long run, your guesses converge to the thing you’re estimating. Sounds eminently reasonable. But it might not be.

Suppose you’re estimating something like the number of car accidents per week in Texas and you counted 308 the first week. What would you estimate is the probability of seeing no accidents the next week?

If you use a Poisson model for the number of car accidents, a very common assumption for such data, there is a unique unbiased estimator. And this estimator would estimate the probability of no accidents during a week as 1. Worse, had you counted 307 accidents, the estimated probability would be -1! The estimator alternates between two ridiculous values, but in the long run these values average out to the true value. Exact in the limit, useless on the way there. A slightly biased estimator would be much more practical.

See Michael Hardy’s article for more details: An_Illuminating_Counterexample.pdf