A few years back, someone asked me: “What is your favorite insight gleaned from data?”
There are many wonderful examples to choose from. My favorite is John Snow’s demonstration that cholera is transmitted by water rather than by air.
Over a few weeks in the summer of 1854, hundreds of people living near Broad and Cambridge Street in London suddenly developed cholera-like symptoms and died. A popular theory at the time was that cholera was spread through the air. However, Snow, who had been studying other cholera epidemics, hypothesized that cholera was spread through water.
To examine his hypothesis, he turned his attention to the local water supply, especially the Broad Street pump. Snow was unable to provide direct proof that the pump was contaminated. However, he cleverly exploited historical controls, took water samples from multiple pumps, and – most famously – visually mapped confirmed cases of cholera. His data and reasoning showed that the Broad Street pump was the by far the most likely explanation for the local cholera outbreak.
Snow used the data to persuade local authorities to remove the pump’s handle to help prevent further infections. Cholera deaths in the same area soon declined. To his credit, Snow was the first to caution that the handle’s removal may have had little to do with the observed decline in deaths, as the death rate had already begun to decrease steadily beforehand.
Although this example comes from well before the electronic age, it foreshadows many familiar aspects of modern data science: an important motivating problem; developing a good, testable hypothesis; evaluating evidence for and against counterfactuals; dealing with messy data; employing multiple analytic methods; informative data visualization; and a compelling storyline.