Some preliminary results from my final project for Applied Spatial Statistics, taught by Brian Reich.
Starting with the point-referenced data from each of the 122 questions in the Harvard Dialect Survey, by Bert Vaux and Scott Golder, we used a k-nearest neighbor smoothing algorithm to estimate the probability of seeing a particular answer—eg, whether a person would say soda, pop, or coke—at every point in the continental US.
For a particular question, we can quantify the difference in dialect between two locations as one minus the overlap in each category. Summing these per-question differences then gives a rough measure of the aggregate dialect difference, which is plotted in the map at right.
Note: The “most similar” and “least similar” cities are limited to those with a population of at least 200,000. (City data from R:maps.) Other dialect maps and further details regarding the model's construction can be found in the accompanying poster.
All coding was done in R / Shiny.
INDIVIDUAL QUESTION MAPS