
Data Wrangling In Turkish: More than Meets the Dotted İ
A few weeks ago, the Turkish Presidency for Migration Management (PMM) announced that more than 1,100 neighborhoods across the country would be closed to residence permit registration/renewal for foreign nationals, along with an Excel spreadsheet of affected neighborhoods. Wanting to get a better understanding of the geographic concentration and breadth of this policy, I knew that a map visualizing those closed neighborhoods would offer a much more accessible look at the data.
This meant:
-
Extracting a list of affected neighborhoods from the source data from the PMM
-
Finding a suitable data set which includes neighborhood names for all of Turkey and their respective geometries (in other words, their geographical boundaries)
-
Matching the PMM-derived list of neighborhoods with those in the geographical data set
Reviewing PMM Source Data
The data published by the PMM is a 1,169 row Excel spreadsheet listing the neighborhoods and, in some cases, subdivisions affected by this policy. Each row contains the name of the province (il), the subprovince (ilçe), and finally the neighborhood (mahalle).
Finding Geographic Data on Turkish Neighborhoods
This entire endeavor would have been impossible if not for the extremely helpful work of some others, including the collection “Turkey GeoJSON” by Çağrı Özkurt. This scraped set of data included geometry and names of nearly 50,000 individual neighborhoods across Turkey: +300 MB of pure text data!
Even loading the file into geospatial analysis software like QGIS took around 3-4 minutes on my laptop. So while I had found the data set I needed to match against, my first real challenge was to whittle down this set of neighborhoods into a smaller subset of match candidates.
Luckily, it was possible to extract a list of subprovinces in Turkey which contained closed neighborhoods from the PMM source data. However, the subprovinces there were listed by name rather than a unique and easy-to-reference identifier. Still, it was easy enough to match these across both data sets, with only minor manual adjustments necessary to account for a few instances where the PMM data’s subprovince name repeated the province as a form of disambiguation (e.g., “Yenişehir / Bursa” and “Yenişehir / Mersin”).
By extracting only those neighborhoods in subprovinces that were known to contain closed neighborhoods, it was possible to reduce the number of potential match candidates from 49,597 to 15,429 possibilities.
From here, it was time to look into how to match the PMM’s list of neighborhood names to the names of the neighborhoods as recorded in the GeoJSON collection.
Fuzzy Matching
While I had been familiar with the term, this project offered me a first opportunity to put ‘fuzzy matching’ into practice: the process through which entries in two different data sources are matched without relying on 1:1 fidelity. This is an extremely valuable capability when you think about how well the human mind is trained to understand that essentially the same object or entity can be referred to by many different names, as well as the human propensity for misspellings, non-standard abbreviations, etc.
There are a few different approaches to fuzzy matching, and the Turkish script in particular poses particular challenges to implementing some of these methods.
One concrete challenge to reckon with is computational burden. Most fuzzy matching algorithms work by taking a string of text in the source data set and comparing that string to each candidate string in the comparison data set. Each of these computations generates a parity score representing the degree to which the two strings match, and then the best match or any match above a specified threshold level is saved.
When working with a data set which contains thousands of source entities and hundreds of thousands or even millions of match candidates, the number of discrete computations necessary can grow to extremes that make working with this approach at scale unfeasible.
So, while my data set was only moderate in comparison, it was still more efficient to reduce the number of potential match candidates as much as possible before running fuzzy matching functions in order to minimize the computational overhead and thus the time I spent waiting in front of my keyboard.
I ended up using the popular Jaro-Winkler distance algorithm, which measures the number of operations that would be required to change one string of characters into another string, thereby producing a metric representative of similarity between two strings.
However, before I could effectively apply the Jaro-Winkler distance function one other helpful step was necessary.
Dotting (and Undotting) Your I’s
The Turkish alphabet contains a number of additional letters such as Ş/ş, Ç/ç, Ö/ö, Ü/ü, Ğ/ğ, I/ı and İ/i. These characters can present a real headache when working with machine-based matching, as for example the uppercase dotless I will not be registered as a match for the uppercase dotted İ. So, to reduce the likelihood of missing matches due to typographical errors it was helpful to standardize all location name data.
There is a helpful function in the stringi R package for natural language processing called stri_trans_general
which takes a string and the argument rules = 'Latin-ASCII'
and returns a transliterated string that replaces those special characters with their closest equivalent in the restricted set of characters contained in this character set. For example, AŞAĞIÇAMURCU KÖYÜ, GİLGİL MEZRASI
becomes ASAGICAMURCU KOYU, GILGIL MEZRASI
.
Once I had implemented these small changes and done some minor data preparation (like changing all instances of “MAH.” to “MH.” for the abbreviation of mahalle), it was fairly straightforward to first find perfectly matched records via a table join, and then to match the remaining records with the Jaro-Winkler algorithm.
In the end, there were only 5-6 records which were unable to be matched in this way, most of which were neighborhoods which had been renamed in the last year or two and therefore whose names had not been updated in the geospatial data set.
I used QGIS to join together the geospatial data with some additional data from the PMM document, and then used the online mapping service Kepler.gl to host and visualize the data.
Sharing the Results
I found it easiest to share my results by creating an ObservableHQ notebook and using Kepler.gl. Here’s an embedded version of the map:
After I’ve had a bit more time to tidy up the R code, I also plan to share an interactive notebook documenting the process and allowing others to reproduce and improve on my admittedly unoptimized approach. For now, it has been an interesting and rewarding challenge that presented the opportunity to learn some new skills and put others to practical use.