Some preliminary results from my final project for Applied Spatial Statistics, taught by Brian Reich.

Starting with the point-referenced data from each of the 122 questions in the Harvard Dialect Survey, by Bert Vaux and Scott Golder, we used a *k*-nearest neighbor smoothing algorithm to estimate the probability of seeing a particular answer—eg, whether a person would say *soda*, *pop*, or *coke*—at every point in the continental US.

For a particular question, we can quantify the difference in dialect between two locations as one minus the overlap in each category. Summing these per-question differences then gives a rough measure of the aggregate dialect difference, which is plotted in the map at right.

**Note: The “most similar” and “least similar” cities are limited to those with a population of at least 200,000.** (City data from R:maps.) Other dialect maps and further details regarding the model's construction can be found in the accompanying poster.

All coding was done in R / Shiny.