Monday, March 8, 2010

The Data of the World

Let's mix some work with play.

The statistician sits at his computer, staring into the dizzying array of scatterplots that tell him everything — almost everything. The interactions, the tests, the higher factorial orders of effects — these will reveal themselves later, all in due time. But these plots! The information they give, in wild matrices of dots seemingly thrown haphazardly across the screen like snowflakes in night — some aligning into curves, some refusing to conform to any kind of humanly-discernible shape — is beyond the comprehension of mere mortals. But not for him.

The statistician sees the patterns that inform the code he’s typing now. Line by line, the instructions that will make everything clear appear dashed in black and white across an incredibly 90s-looking interface, a boring GUI that hasn’t been updated in years, but he doesn’t need it to be. He doesn’t need anything flashy or drag-and-drop. He only needs the numbers that will come up, the P-values and the tables of errors and sums of squares that will decide the fate of the data.

On one side of the monitor are the dots. On the other side is the code. The tension, the fiery conversation between these two things are clear right now only in the statistician’s mind. Like a fight between the powers of Good and Evil: one is trying to hide the Truth, throw order into more confusion and disorder, while the other tries to fight back the darkness with a dimly-bulbed flashlight.

Each dot represents something — a person with or without syphilis, an automobile that breaks down or not, a cell that may have divided, a protein, a lab rat, an irate chimpanzee wearing lipstick, something — but that’s not what anyone else would see. What anyone else would see is just chaos. And the data? In its raw form? The data! Imagine a spread sheet from hell. Imagine hundreds, maybe thousands of lines of numbers out to crazy decimals, with esoteric columns that not even the experimenter knows what to do with. It’s randomness, chaos, meaningless junk from which nothing important can be grasped.

But the statistician is here to change that. It’s an Olympian task. The data fights you. It’s a Sysyphean burden. It’s like trying to pull a sensical story from the Library of Babel. Like drawing symphony out of piano notes played by a monkey with one hand as he composes Shakespeare with another. But you don’t have an infinite number of monkeys. You have N monkeys. N monkeys at N typewriters and what are your chances of Romeo & Juliet? Crunch the numbers. Do the math. It isn’t looking good.

And what if the experimenter screwed up somewhere? What if one monkey threw feces at another? The mega-outliers, the influential points that sit at the edge of the graphs, like angry, delicate snowflakes that threaten to end the world in ice, and you have to make the choice: stamp them out, or does Romeo really marry Juliet in a frostbitten outhouse? What if he does? This is a documentary and you can’t get it wrong.

Then: the code is done. The enter key is waiting. The statistician’s eyes gloss over as he looks at the data points one last time, hoping, praying. His index finger is poised, lingers in the air for N factorial seconds, an infinity in monkey time, comes down.

And the CLACK you hear is the sound of order from chaos.