Kindle Optimized Turkish-to-English Dictionary
Some problems I run into won’t let my mind rest until I’ve found a way to solve them. It was both a blessing and a curse at my first office job where I’d often leave work only to continue researching and testing solutions to the issues puzzling me late into the night. The itch just demands to be scratched.
A month ago I decided to buy a Kindle in order to capitalize on my daily train commutes, using the 10 to 15 minute trips to slowly work through books that I’d otherwise be forced to sideline until the holidays. I was really happy with the results: whereas I was always too put off by the motions of taking a bulky book out of my bag, carefully monitoring my trip and preparing a bookmark before carefully stowing my book, with the Kindle I stopped scrolling through my phone on train rides and began to finish a few books instead. In fact, I had nothing but praise for the little device until I decided I’d like to read a Turkish novel.
Indeed, the Kindle is perfectly suited for reading in a language that you are still learning because any word can be highlighted with your fingertip, initiating a dictionary search that returns a definition/translation. Hell, I have even used the handy dictionary feature while reading in my native English and picking up new vocabulary like “garret: a top-floor attic room” and “shoat: a young pig.” Searched words are also automatically added to a “Vocabulary Builder,” allowing you to review them with a flashcard function. Unfortunately, the Kindle only comes packaged with dictionaries of the languages spoken where the Kindle Store is supported: Turkey is, for many reasons, not one of those countries and thus Turkish is not officially supported on the platform. I downloaded a few Turkish-English dictionaries available in the Kindle store and was disappointed to see that they were all more or less useless for an agglutinative language like Turkish.
The Turkish language works like an Erector Set: there are roots or stems to which suffixes are “bolted on,” furthering their meaning. For example: su means water, and is an uninflected root. The suffix -sUz works like the English suffix -less, thus susuz means waterless, dehydrated. But suffixes are also appended to roots to show their role in the sentence, which means that the words in a Turkish text rarely appear in their untouched, uninflected forms. In order for a dictionary to work effectively on the Kindle, the myriad permutations of possible inflections need to be linked to root entries. This is unfortunately not the case for the Turkish-English dictionaries available via the Amazon Kindle Store, and therefore these dictionaries almost inevitably show “No result found in dictionary” when used.
The problem was now apparent and all that was left was for the itch to become so unbearable that I’d be compelled to spend enough time to come up with a solution. This lazy Sunday was the perfect excuse.
So, I needed to either find a dictionary file that contained inflections and then figure out how to convert it into a Kindle-friendly format, or I needed to somehow match roots to an extant Turkish-English dictionary and thus create my own file. Luckily, I was able to find a massive corpus with inflections: the Babylon Turkish-English dictionary boasting almost 169,000 definitions. The file itself is available for download as a 7.5 MB .BGL file, unfortunately a proprietary dictionary filetype. Of course, finding this dictionary file was the true crux of the matter: all the necessary information had been compiled and was contained inside. However, the biggest hurdles would come in trying to convert these millions of lines of text into a format that the Kindle could read and use.
My first thought was to use an approach outlined on 1ManFactory.com (Note: This site seems to no longer be functioning) which involves converting a tab-delimited dictionary file to an OPF file (a kind of XML-wrapper for HTML files containing the core text). However, the first step would have to be converting the proprietary .BGL file into a tab-delimited text file. A Github user had already tackled this issue and provided his software, a Python program make the conversion BglConverter. After struggling with dependency and build errors I was finally able to get the program to run, only to realize that the tab-delimited format did not support inflected dictionaries. Thus, I’d need to use a different approach than the one outlined in 1manfactory.com’s tutorial.
Another user had built a small piece of software in C# called Babylon To HTML, which performed just as the name implied. It was at this point that the enormity of the dictionary became evident: the resulting HTML file (structured nicely to include inflected forms with their proper tagging) weighed in at almost 75 MB and was cumbersome to work with. I realized that the Kindlegen tool utilized in 1manfactory.com’s approach also accepted a single HTML file as an input file and attempted to convert the file straight away. My poor laptop’s fan roared and the process consumed nearly 90% of available RAM before stalling.
A few things were wrong:
- Kindlegen parses input files to ensure conformity with Amazon-specified formatting and content. The process was attempting to force-close
<p>paragraphs on every entry in the massive file, causing the entire process to hang before getting anywhere near even a fourth of the way through the file. Solution: Delete all instances of
</p>with a text editor with a find & replace all function. The tags are not needed and do not affect the formatting, so deleting them will only decrease processing time.
- An odd character had slipped into the dictionary file and was inserted into the file as
….. I again used find & replace all to delete these instances.
- The file is too damn large!
After taking care of the first two issues I was hopeful that the process would eventually power through and output a .mobi file. I let the fan spin for 40 minutes before coming to the conclusion that the file needed to be split up if there was any chance that Kindlegen could complete successfully.
I took a look into the structure of the OPF files and realized that the format was actually quite straight-forward: a basic XML file with a manifest listing the individual HTML files included in the eBook and an index (spine) that assigns the appropriate order of these files. If I could find a way to split the behemoth HTML file into smaller components, I could package them in an OPF and Kindlegen might be able to process them after all. I researched automated methods of splitting the HTML, but couldn’t find anything suitable. I’d need each file to retain the header and footer information, and include some number of dictionary entry iterations sandwiched in between.
Ultimately I decided to bite the bullet and split the HTML file by hand. I figured that something around 15-20 “chapters” would reduce the size enough to allow Kindlegen to process successfully. There were something like 200,000+ lines of text in the original HTML file, so I simply took this number and divided by 15. Then I navigated to the corresponding line number, selected all of the above text and pasted it into a new file. I did this until all files were roughly the same size, went back to add the footer HTML tags to each document and then named them (following the original order) as tr00.html and so on. I then needed to update the OPF file with the corresponding file names and numbers.
Then came my moment of truth: I ran Kindlegen with the new OPF input and after a few long minutes it spat out a 7.5 MB mobi eBook. I transferred this to Calibre, added a cover and spruced up the metadata and then crossed my fingers as sent the file to my Kindle. Opening Sabahattin Ali’s Kürk Mantolu Madonna, I looked for some inflected words.
To my great happiness and relief, it worked (mostly)! Words like “tesirden” (composed of tesir (effect) and the suffix -dAn meaning from) and “canlanıyor” (conjugated form of the verb canlanmak) were correctly called up under their roots. There are some notable exceptions:
- The provided verb conjugations, while rather thorough, are not complete. This limits the ability to highlight any verb in a text and receive a definition. -abil- -ama-, and the -miş forms are among those missing. It’s possible that the missing forms could be added to the dictionary with relatively pain-free automation, but that would be a task for another Sunday.
- The dictionary file does not contain all verbs. Some rather common ones that are missing: artmak, oluvermek, illiklemek…
- The Vocabulary Builder does not associate saved words with the dictionary, and thus the function is not useful. This may be related to quick-to-fix problems with the mobi file’s metadata, or may be an impossible-to-fix issue from Amazon’s side as Turkish is not officially supported. I will be looking into this.
I have to say that I am really pleased with the result. Being a somewhat confident Turkish speaker, the ability to find the meaning of a root from inflected text means that I can put the rest of the puzzle together on my own, allowing me to read fluidly without needed to move to a second device or paper dictionary for word look-up. The dictionary contains an impressive list of definitions with an especially strong showing of nouns, and now I’m confident that I’ll be able to use the device as I intended and read Turkish novels on the go.
If anyone stumbles across this page looking for a download-ready dictionary file, I’ve uploaded it here: Turkish-English Dictionary for Kindle. Happy reading!