Exploratory Spatial Data Analysis: Emerging tools and concepts

	This article originally appeared in Geospatial Solutions Magazine's Net Results column of July 1, 2003. Other Net Results articles about the role of emerging technologies in the exchange of spatial information are also online.

1. Introduction and Glossary 2. Naive to cynical 3. Cynical to critical 4. Pure Critics

Critical spatio-statistical thinkers

The statisticians agree that space matters, but they aren’t sure how much. Taking the opposite tack from Dr. Snow in his time of cholera, Stanford doctoral student and NIH grantee Michael Choy is using an already well-documented association between demographic profile and stomach cancer to test the validity of a spatial-statistical aggregation technique. Unlike London’s localized cholera pandemic, stomach cancer is relatively rare and spread thinly across the US population. In all of the year 2000, for instance, the incidence of stomach cancer was approximately 5 per 1,000 people in the greater San Francisco Bay Area (www.nccc.org/pdf/Registries/annual_reports/incidence/stomach1.pdf), compared to the 1854 cholera outbreak’s dramatic 32 deaths per 1,000 people in one London neighborhood in less than two months. This makes understanding the spatial distribution of stomach cancer a more elusive subject.

Numerous epidemiological studies have established a stomach bacterium, Helicobacter pylori as the etiologic agent of stomach cancer. Infection with H. pylori is strongly correlated with a demographic profile including parameters of age, race/ethnicity, gender, income, place of birth (foreign versus native) and smoking behavior. The stomach cancer records also include a spatial reference: the census block group. Comparing the demographics of block groups containing stomach cancer should reveal the same strong correlation researchers have already derived independently of spatial information. However, because of the relative rarity of stomach cancer, many block groups have no stomach cancer incidents. This eliminates their populations from the sample and increases the error associated with the estimate of the association. What to do?

Choy’s problem, measuring the margin of error in his own statistics, should be familiar to anyone interested in politics. Consider a political poll: 45 percent for Bush, 45 percent for Gore, with a margin of error of plus-or-minus 4 percent. Without the margin of error, Bush and Gore appear equally popular. But given the margin of error, either candidate could be as much as 8 percentage points ahead or behind the other. While common in polls, reliable error calculation in spatial methodologies remains a field of active research. So, coming up with an estimate of association is only half of the problem. It takes an accurate assessment of the error to address the question: "How good is our answer?"

The devil and the details. Choy’s strategy is to aggregate contiguous block groups, cancer-free or otherwise, based on their common demographic profiles. This approach increases the overall sample size and stability of the statistical correlation. Easy enough to conceptualize, but again, the devil is in the details. Choy wants to know how the results of his aggregations differ depending on the order in which he aggregates contiguous block groups, and on the degree of similarity required for aggregation. For both variables, he is working on computationally intensive methods that run through the spectrum of possibilities, remember each result, and return the entire range of results.

The vendor community is also pushing the ESDA envelope. Steve Kopp of ESRI, lead product specialist for Spatial Analyst, is expanding the ESDA tools in ArcGIS beyond those already in the Geostatistical Analyst extension to include investigating and quantifying relationships of point and polygon data, as well as exploring multitemporal, multiscenario ESDA techniques. Kopp described ESRI’s efforts as "ways to visualize and summarize a spectrum of analyses to detect trends or patterns," and hinted that some of the most interesting applications of ESDA are still evolving. "What best communicates the results of multiple simulations of a problem scenario?" asked Kopp. "You can't just show the user 100 slightly different maps, or blend them into one summary map, for example, there needs to be useful visualization and exploration tools to make use of the results."

Maybe we should check with Joel Best, who must himself fall into the critical statistical thinking category. Is there a best practice for critically designed ESDA techniques? In Best’s own words, “No statistic is perfect, but some are less imperfect than others. Good or bad, every statistic reflects its creators’ choices.” So, if you don’t like the stats, go out and make some of your own.

References
Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists. Joel Best (2001, University of California Press).
How to Lie with Maps. Mark Monmonier, (1991, University of Chicago Press).
Visual Explanations. Edward Tufte, (1997, Graphics Press).
Interactive Spatial Data Analysis. Bailey, T. and A. Gatrell (1995, Longman).
Quantitative Geography, Perspectives on Spatial Data Analysis. Fotheringham, A.S., C. Brunsdon and M. Charlton (2000, Sage Publications).


1. Introduction and Glossary 2. Naive to cynical 3. Cynical to critical 4. Pure Critics