Archive for January, 2008

Selection bias and bombers

Monday, January 21st, 2008

During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their bombers. After analyzing the records, he recommended adding more armor to the places where there was no damage!

This seems backward at first, but Wald realized his data came from bombers that survived. That is, the British were only able to analyze the bombers that returned to England; those that were shot down over enemy territory were not part of their sample. These bombers’ wounds showed where they could afford to be hit. Said another way, the undamaged areas on the survivors showed where the lost planes must have been hit because the planes hit in those areas did not return from their missions.

Wald assumed that the bullets were fired randomly, that no one could accurately aim for a particular part of the bomber. Instead they aimed in the general direction of the plane and sometimes got lucky. So, for example, if Wald saw that more bombers in his sample had bullet holes in the middle of the wings, he did not conclude that Nazis liked to aim for the middle of wings. He assumed that there must have been about as many bombers with bullet holes in every other part of the plane but that those with holes elsewhere were not part of his sample because they had been shot down.

Repairing tumors

Saturday, January 19th, 2008

Imagine this conversation with your doctor:

Your poor tumor. It has a chaotic blood supply. Parts of it get too much blood, other parts too little. We’re going to give you a drug to improve your tumor’s blood supply, making it healthier.

Before you run screaming from your doctor’s office, see if there’s a copy of the January 2008 issue of Scientific American in the waiting room. If there is, read the article Taming Vessels to Treat Cancer by Rakesh Jain.

Just as the cells in a tumor are abnormal and growing out of control, so are the blood vessels that feed the tumor. This lack of proper infrastructure inhibits the tumor’s growth, but it also makes it difficult to deliver chemotherapy to the tumor. This lead to the radical idea to make the tumors healthier in preparation for killing them.

So how would you go about improving a tumor’s circulatory system? By administering a drug that was designed to attack tumor vessels!

A new class of cancer drugs, antiangiogenic agents, has been designed to attack tumors by cutting off their blood supply. These agents haven’t been a complete success. Experience with one such agent, Avastin, shows that while it shuts down some of the blood vessels in tumors, it may make the remaining tumor vessels healthier. That’s bad news if you’re treating patients with Avastin alone. But when used in combination with chemotherapy, it’s just what people like Dr. Jain were looking for: a way to normalize the blood flow in a tumor in order to make it more vulnerable to chemotherapy.

More information, including videos, is available at the web site of Dr. Jain’s lab.

Children don’t like clowns

Saturday, January 19th, 2008

Hospitals often paint clowns on the walls of the pediatric wing assuming children like them. When someone finally asked kids whether they like clowns, they found that not one out of 255 kids questioned did.

See No Clowning for Hospitalized Kids for details.

Interesting is better than perfect

Friday, January 18th, 2008

Seth Godin has an interesting blog post today called The problem with perfect. Companies with a reputation for perfect service are only remarkable when they disappoint. Being interesting is a more viable business strategy than being perfect.

Thick tails

Friday, January 18th, 2008

Bart Kosko in his book Noise argues is that thick-tailed probability distributions such as the Cauchy distribution are common in nature. This is the opposite of what I was taught in college. I remember being told that the Cauchy distribution, a distribution with no mean or variance, is a mathematical curiosity more useful for constructing academic counterexamples than for modeling the real world. Kosko disagrees. He writes

… all too many scientists simply do not know that there are infinitely many different types of bell curves. So they do not look for these bell curves and thus they do not statistically test for them. The deeper problem stems from the pedagogical fact that thick-tailed bell curves get little or no attention in the basic probability texts that we still use to train scientists and engineers. Statistics books for medicine and the social sciences tend to be even worse.

We see thin-tailed distributions everywhere because we don’t think to look for anything else. If we see samples drawn from a thick-tailed distribution, we may throw out the “outliers” before we analyze the data, and then a thin-tailed model fits just fine.

How do you decide what’s an outlier? Two options. You could use your intuition and discard samples that “obviously” don’t belong, or you could use a formal test. But your intuition may implicitly be informed by experience with thin-tailed distributions, and your formal test may also circularly depend on the assumption of a thin-tailed model.

Quick TeX to graphic utility

Thursday, January 17th, 2008

Here’s a web site where you can type in some TeX code, click a button, and get back a GIF with a transparent background. Handy for pasting equations into HTML.

For example:

gaussian integral

Coping with exponential growth

Thursday, January 17th, 2008

Everything is supposedly growing exponentially these days. But when most people say “exponential,” they don’t mean what they say. They mean “fast.” Exponential growth can indeed be fast. Or it can be slow. Excruciatingly slow.

If you earn a million dollars a day, your wealth is growing quickly, but is not exponentially. And if you have $100 in the bank earning 3% compound interest, you’re money is growing slowly, but it is growing exponentially.

Linear growth is a constant amount of increase per unit of time. Exponential growth is a constant percentage increase per unit of time. If you buy a pack of baseball cards every Friday, the size of your baseball card collection will grow linearly. But if you breed rabbits with no restriction, the size of your bunny heard will grow exponentially.

It matters a great deal whether you’re growing linearly or exponentially.

When you start a new enterprise — a company, a web site, etc. — it may truly grow exponentially. Growth may be determined by word of mouth, which is exponential (at first). The number of new people who hear each month depends on the number people who talk, and hearers become talkers. But that process can be infuriatingly slow when it’s just getting started. If the number of visitors to your web site is growing 5% per month, that’s great in the long term, but disappointing at first when it means going from 40 visitors one month to 42 the next.

How do you live on an exponential curve? You need extraordinary patience. While any exponential curve will eventually pass any linear curve, it may take a long time. If you’re making barely perceptible but compounding progress, be encouraged that you’re on the right curve. Eventually you’ll have all the growth you can handle. Realize that you may be having a harder time initially because you’re on the exponential curve rather than the linear curve.

How do you know whether you’re on an exponential curve? This is not as easy as it sounds. Because of random noise, it may be hard to tell from a small amount of data whether growth is linear or exponential, or even to tell growth from stagnation. Eventually the numbers will tell you. But until enough data come in to reveal what’s going on, look at the root causes of your growth. If you’re growing because customers are referring customers, that’s a recipe for exponential growth. If you’re growing because you’re working more hours, that’s linear growth.

Nothing grows exponentially forever. Word of mouth slows down when the message reaches saturation, when the talkers run into fewer people who haven’t heard. Rabbit farms slow down when they can’t feed all the rabbits. Most of the things we call exponential growth are more accurately logistic growth: exponential growth slows to linear growth, then linear growth begins to plateau.

How do you live on a logistic curve? Realize that initial exponential growth doesn’t last. Watch the numbers. They’ll tell you when you’ve gone from approximately exponential to approximately linear. Understand the mechanisms that turn exponential into logic growth in your context.

Stochastic independence

Wednesday, January 16th, 2008

Independence in probability can be both intuitive and mysterious. Intuitively, two events are independent if they have nothing to do with each other. Suppose I ask you to guess whether the next person walking down the street is left handed. I stop this person, and before I ask which hand he writes with, I ask him for the last digit of his phone number. He says it’s 3. Does knowing the last digit of his phone number make you change your estimate of the chances this stranger is a south paw? No, phone numbers and handedness are independent. Presumably about 10% of right-handers and the same percentage of left-handers have this distinction. Even if the phone company is more likely or less to assign numbers ending in 3, there’s no reason to believe they take customer handedness into account when handing out numbers. On the other hand, if I tell you the stranger is an artist, that should change your estimate: a disproportionate number of artists are lefties.

Formally, two events A and B are independent if P(A and B) = P(A) P(B). This implies that P(A | B), the probability of A happening given that B happened, is just P(A). Similarly P(B | A) = P(B).  Knowing whether or not one of the events happened tells you nothing about the likelihood of the other. Knowing someone’s phone number doesn’t help you guess which hand they write with, unless you use the phone number to call them and ask about their writing habits.

Now lets extend the definition to more events. A set of events is mutually independent if the probability of any subset of two or more events is the product of the probabilities of each event separately.

So, let’s look at three events: A, B, and C. If we know P(A and B and C) = P(A) P(B) P(C), are the three events mutually independent? Not necessarily. It is possible for the above equation to hold and yet P(A and B) is not equal to P(A) P(B). The definition of mutual independence requires something of every subset of {A, B, C} with two or more elements, not just the subset consisting of all elements. So we have to look at the subsets {A, B}, {B, C}, and {A, C} as well.

What if A and B are independent, B and C are independent, and A and C are independent? In other words, every pair of events is independent. Is that enough for mutual independence? Surprisingly, the answer is no. It is possible to construct a simple example where

  • P(A and B) = P(A) P(B)
  • P(B and C) = P(B) P(C)
  • P(A and C) = P(A) P(C)

and yet P(A and B and C) does not equal P(A) P(B) P(C).

There are no short cuts to the definition of mutual independence.

Sun acquires MySQL

Wednesday, January 16th, 2008

Jonathan Schwartz announced on his blog today that Sun is acquiring the company behind MySQL.

Obscuring complexity

Wednesday, January 16th, 2008

Here’s a great quote from The Logic of Failure on obscuring complexity.

By labeling a bundle of problems with a single conceptual label, we make dealing with that problem easier — provided we’re not interested in solving it. … A simple label can’t make the complex nature of the problem go away, but it can so obscure complexity that we lose site of it. And that, of course, we find a great relief.

Revisiting six degrees and hubs

Tuesday, January 15th, 2008

In 1967, Stanley Milgram conducted the famous experiment that told us there are six degrees of separation between any two people. He gave letters to 160 people in Nebraska and asked them to pass them along to someone who could eventually get the letters to a particular stock broker in New York. It took about six links for each letter to get to the stock broker.

In 2001, Duncan Watts repeated the experiment, asking 61,000 people to forward emails to eventually reach 18 targets worldwide. The emails took roughly six hops to reach their targets, giving Milgram’s original conclusion more credibility due to the larger sample.

But a secondary conclusion of Milgram didn’t hold up. In Milgram’s experiment, half of the letters reached the target via the same three friends of the stock broker. These people were deemed “hubs.” In Watts’s experiment, only 5% of messages reached their targets via hubs. Watts’s thesis is that while hubs exist — some people are far more connected than others — they’re not as important in spreading ideas as once supposed.

See “Is the Tipping Point Toast?” in the February 2008 issue of Fast Company for more on Duncan Watts and his skepticism regarding the importance of hubs.

Literate programming and statistics

Tuesday, January 15th, 2008

Sweave, mentioned in my previous post, is a tool for literate programming. Donald Knuth invented literate programming and gives this description of the technique in his book by the same name:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

Knuth says the quality of his code when up dramatically when he started using literate programming. When he published the source code for TeX as a literate program and a book, he was so confident in the quality of the code that he offered cash rewards for bug reports, doubling the amount of the reward with each edition. In one edition, he goes so far as to say “I believe that the final bug in TeX was discovered and removed on November 27, 1985.” Even though TeX is a large program, this was not an idle boast. A few errors were discovered after 1985, but only after generations of Stanford students studied the source code carefully and multitudes of users around the world put TeX through its paces.

While literate programming is a fantastic idea, it has failed to gain a substantial following. And yet Sweave might catch on even though literate programming in general has not.

In most software development, documentation is an after thought. When push comes to shove, developers are rewarded for putting buttons on a screen, not for writing documentation. Software documentation can be extremely valuable, but it’s most valuable to someone other than the author. And the benefit of the documentation may only be realized years after it was written.

But statisticians are rewarded for writing documents. In a statistical analysis, the document is the deliverable. The benefits of literate programming for a statistician are more personal and more immediate. Statistical analyses are often re-run, with just enough time between runs for the previous work to be completely flushed from term memory. Data is corrected or augmented, papers come back from review with requests for changes, etc. Statisticians have more self-interest in making their work reproducible than do programmers.

Patrick McPhee gives this analysis for why literate programming has not caught on.

Without wanting to be elitist, the thing that will prevent literate programming from becoming a mainstream method is that it requires thought and discipline. The mainstream is established by people who want fast results while using roughly the same methods that everyone else seems to be using, and literate programming is never going to have that kind of appeal. This doesn’t take away from its usefulness as an approach.

But statisticians are more free to make individual technology choices than programmers are. Programmers typically work in large teams and have to use the same tools as their colleagues. Statisticians often work alone. And since they deliver documents rather than code, statisticians are free to use use Sweave without their colleagues’ knowledge or consent. I doubt whether a large portion of statisticians will ever be attracted to literate programming, but technological minorities can thrive more easily in statistics than in mainstream software development.

Irreproducible analysis

Tuesday, January 15th, 2008

Journals and granting agencies are prodding scientists to make their data public. Once the data is public, other scientists can verify the conclusions. Or at least that’s how it’s supposed to work. In practice, it can be extremely difficult or impossible to reproduce someone else’s results. I’m not talking here about reproducing experiments, but simply reproducing the statistical analysis of experiments.

It’s understandable that many experiments are not practical to reproduce: the replicator needs the same resources as the original experimenter, and so expensive experiments are seldom reproduced. But in principle the analysis of an experiment’s data should be repeatable by anyone with a computer. And yet this is very often not possible.

Published analyses of complex data sets, such as microarray experiments, are seldom exactly reproducible. Authors inevitably leave out some detail of how they got their numbers. In a complex analysis, it’s difficult to remember everything that was done. And even if authors were meticulous to document every step of the analysis, journals do not want to publish such great detail. Often an article provides enough clues that a persistent statistician can approximately reproduce the conclusions. But sometimes the analysis is opaque or just plain wrong.

I attended a talk yesterday where Keith Baggerly explained the extraordinary steps he and his colleagues went through in an attempt to reproduce the results in a medical article published last year by Potti et al. He called this process “forensic bioinformatics,” attempting to reconstruct the process that lead to the published conclusions. He showed how he could reproduce parts of the results in the article in question by, among other things, reversing the labels on some of the groups. (For details, see “Microarrays: retracing steps” by Kevin Coombes, Jing Wang, and Keith Baggerly in Nature Medicine, November 2007, pp 1276-1277.)

While they were able to reverse-engineer many of the mistakes in the paper, some remain a mystery. In any case, they claim that the results of the paper are just wrong. They conclude “The idea … is exciting. Our analysis, however, suggests that it did not work here.”

The authors of the original article replied that there were a few errors but that these have been fixed and they didn’t effect the conclusions anyway. Baggerly and his colleagues disagree. So is this just a standoff with two sides pointing fingers at each other saying the other guys are wrong? No. There’s an important asymmetry between the two sides: the original analysis is opaque but the critical analysis is transparent. Baggerly and company have written code to carry out every tiny step of their analysis and made the Sweave code available for anyone to download. In other words, they didn’t just publish their paper, they published code to write their paper.

Sweave is a program that lets authors mix prose (LaTeX) with code (R) in a single file. Users do not directly paste numbers and graphs into a paper. Instead, they embed the code to produce the numbers and graphs, and Sweave replaces the code with the results of running the code. (Sweave embeds R inside LaTeX the way CGI embeds Perl inside HTML.) Sweave doesn’t guarantee reproducibility, but it is a first step.

Tips for learning regular expressions

Monday, January 14th, 2008

Here are a few realizations that helped me the most when I was learning regular expressions.

1. Regular expressions aren’t trivial. If you think they’re trivial, but you can’t get them to work, then you feel stupid. They’re not trivial, but they’re not that hard either. They just take some study.

2. Regular expressions are not command line wild cards. They contain some of the same symbols but they don’t mean the same thing. They’re just similar enough to cause confusion.

3. Regular expressions are a little programming language.Regular expressions are usually contained inside another programming language, like JavaScript or PowerShell. Think of the expressions as little bits of a foreign language, like a French quotation inside English prose. Don’t expect rules from the outside language to have any relation to the rules inside, no more than you’d expect English grammar to apply inside that French quote.

4. Character classes are a little sub-language within regular expressions. Character classes are their own little world. Once you realize that and don’t expect the usual rules for regular expressions outside character classes to apply, you can see that they’re not very complicated, just different. Failure to realize that they are different is a major source of bugs.

Once you’re ready to dive into regular expressions, read Jeffrey Friedl’s book. It’s by far the best book on the subject. Read the first few chapters carefully, but then flip the pages quickly when he goes off into NFA engines and all that.

Irrelevant uncertainty

Monday, January 14th, 2008

Suppose I asked where you want to eat lunch. Then I told you I was about to flip a coin and asked again where you want to eat lunch. Would your answer change? Probably not, but sometimes the introduction of irrelevant uncertainty does change our behavior.

In a dose-finding trial, it is often the case that a particular observation has no immediate importance to decision making. Suppose Mr. Smith’s outcome is unknown. We calculate what the next dose will be if he responds to treatment and what it will be if he does not respond. If both doses are the same, why wait to know his outcome before continuing? Some people accept this reasoning immediately, while others are quite resistant.

Not only may a patient’s outcome be irrelevant, the outcome of an entire clinical trial may be irrelevant. I heard of a conversation with a drug company where a consultant asked what the company would do if their trial were successful. He then asked what they would do if it were not successful. Both answers were the same. He then asked why do the trial at all, but his question fell on deaf ears.

While it is irrational to wait to resolve irrelevant uncertainty, it is a human tendency. For example, businesses may delay a decision on some action pending the outcome of a presidential election, even if they would take the same action regardless which candidate won. I see how silly this is when other people do it, but it’s not too hard for me to think of analogous situations where I act the same way.