Below is an excerpt from Wired magazine, also see this earlier entry on the ‘file drawer problem’ discussing this same problem:
In 1981, the New England Journal of Medicine published a Harvard study that showed an unexpected link between drinking coffee and pancreatic cancer. As it happened, researchers were anticipating a connection between alcohol or tobacco and cancer. But according to the survey of several hundred patients, booze and cigarettes didn’t seem to increase your risk. Then came a surprise: An incidental survey question suggested that coffee did increase the chances of pancreatic cancer. So that’s what got published.
Those positive results, alas, were entirely anomalous; 20 years of follow-up research showed the coffee-cancer connection to be bunk. Nonetheless, it’s a textbook example of so-called publication bias, where science gets skewed because only positive correlations see the light of day. After all, the surprising findings are what makes the news (and careers).
So what happens to all the research that doesn’t yield a dramatic outcome — or, worse, the opposite of what researchers had hoped? It ends up stuffed in some lab drawer. The result is a vast body of squandered knowledge that represents a waste of resources and a drag on scientific progress. This information — call it dark data — must be set free.
For the past couple of years, there’s been much talk about open access, the idea that more scientific publications should be freely available — not locked behind firewalls and subscriptions. Thanks to the Public Library of Science (PLoS) and other organizations, that notion is making headway. Liberating dark data takes this ethos one step further. It also makes many scientists deeply uncomfortable, because it calls for them to reveal their “failures.” But in this data- intensive age, those apparent dead ends could be more important than the breakthroughs. After all, some of today’s most compelling research efforts aren’t one-off studies that eke out statistically significant results, they’re meta-studies — studies of studies — that crunch data from dozens of sources, producing results that are much more likely to be true. What’s more, your dead end may be another scientist’s missing link, the elusive chunk of data they needed. Freeing up dark data could represent one of the biggest boons to research in decades, fueling advances in genetics, neuroscience, and biotech.
So why doesn’t it happen? In part, it’s a logistics problem: Advocating the release of dark data is one thing, but it’s quite another to actually collect it, juggling different formats and standards. And, of course, there’s the issue of storage. These days, an astronomical study of quasars or an ambitious bioinformatics project can generate several terabytes of data. Few have the capacity to store that, let alone analyze it.
Google, among others, is lending a hand with its Palimpsest project, offering to store and share monster-size data sets (making the data searchable isn’t a part of the effort). As storage costs drop, similar data banks will emerge, along with format standards, and it should become ever easier to share results, good or bad.
Technology is actually the simple part. The tougher problem lies in the culture of science. More and more, research is funded by commercial entities, which deem any results proprietary. And even among fair-minded academics, the pressures of time, tender, and tenure can make openness an afterthought. If their research is successful, many academics guard their data like Gollum, wringing all the publication opportunities they can out of it over years. If the research doesn’t pan out, there’s a strong incentive to move on, ASAP, and a disincentive to linger in eddies that may not advance one’s job prospects.