How Nasreddin was created?

My journey during development of the most challenging Azerbaijani Turkish text transliterate-translate project and my next steps

Araz Gholami
6 min readApr 25, 2024

--

Nasreddin.org Main Page

Azerbaijani Turkish is a language that has experienced many ups and downs. This language, like other Turkic languages, was initially written with the old Arabic alphabet. Later, in the Republic of Azerbaijan, it was first written with Latin, then with Cyrillic, and again with Latin. In Iran’s Azerbaijan, various orthographic proposals were made, and eventually, in the second orthography seminar, the standardization process of the Azerbaijani Turkish language with the Arabic alphabet was completed. (Although, even now, due to the lack of familiarity of speakers with the approved alphabet or the lack of appropriate tools, many write this language with non-standard alphabets.)

Years ago, in 2014, one of my friends suggested that I create software that could convert Azerbaijani Turkish text from the Latin alphabet to the Arabic alphabet. At first, it seemed very simple, just replace “a” with “آ” and done. But it wasn’t that simple. The presence of numerous words with Arabic roots (which had to be written with specific Arabic letters), the method of attaching suffixes, the placement of the “ə” sound at the beginning, middle, and end of the word, and dozens of other challenges caused me to abandon the initial project, which I had created at that time under the name “Küçürən”.

After 10 years in “International Mother Language Day” and with some free time on hand, following the exercises I had in Front-End programming and JavaScript frameworks, I returned to this project and created a new user interface for it. Now, the main challenge remained: rebuilding the alphabet conversion system.

The first step was to collect all the words with foreign roots (English and Latin, Russian, Arabic, and Persian) used in Azerbaijani Turkish and manually produce their equivalents in the Arabic alphabet. A large part of these words was collected from Wiktionary and Wikipedia using automation scripts. Then, through trial and error, converting dozens of texts with Arabic context (religious texts and articles), extracting problematic words, adding their correct equivalents from the vocabulary, more than 2500 key foreign words were stored in the Nasreddin database.

The next step was to identify suffixes. Without recognizing the suffixes, a word like “səda” would be correctly converted to “صدا” (not “سدا”), but a word like “sədasi” would not be recognizable and would be written as “سداسی” (not “صداسی”). Here, the AzConvert database came into play and supplemented Nasreddin’s suffix repository with over 500 suffixes and I added another 200 by myself.

By identifying the roots of Arabic (and other foreign) words, as well as identifying suffixes and separating them from the word sequentially, Nasreddin achieved 100% accuracy in converting Latin alphabet words to Arabic. This meant that an infinite opportunity gate had opened for content. It was enough to give Nasreddin a book in Azerbaijani Turkish with the Latin alphabet and receive the Azerbaijani Turkish version with the Arabic alphabet in less than a minute. Or by using Google Translate or any other translator, convert various texts with different languages to Azerbaijani Turkish with the Latin alphabet, then convert the result to the Arabic alphabet using the same system.

Now only one thing was left: converting Arabic back to Latin. This issue was not possible automatically due to the lack of proper spelling and, of course, the absence of the sound “ə” among the words. For example, there was no way to write the word “گئتمک” as “getmək”. Neither by identifying suffixes nor by identifying word roots nor by any other algorithm. (Believe me! I tried all methods, like identifying phonetic groups and etc.)

Here is where this idea came to my mind:
Collecting a database of Latin words, converting them with a Latin-to-Arabic system in various writing styles, and storing them. For example, the Latin word “apardı” was produced in two styles: “آپاردؽ” and “آپاردی” and stored in the vocabulary repository. Now, the only thing missing was the source of Azerbaijani Turkish words with the Latin alphabet, and at first glance, what could be better than Wikipedia. I got to work and downloaded the Dump file of Azerbaijani Turkish Wikipedia with the Latin alphabet and tried to separate the existing data into meaningful paragraphs, then convert each paragraph into words and each word into different writing styles. An initial algorithmic estimation showed an approximate 4-year learning time for the entire data, which was practically impossible. A new attempt to do this with various scripts reduced this figure to 10 minutes. In total, more than one million words and more than 1.8 million different writing styles were extracted, and despite problems such as spelling mistakes in the original Wikipedia articles, the use of non-standard characters like e and a, etc., were addressed with several trials and errors, and the Arabic vocabulary repository was completed to an acceptable form in Latin.

Nasreddin’s Azerbaijani Turkish (in Arabic) to Latin Transliteration Example

Now the possibility of converting Azerbaijani Turkish texts from the Arabic alphabet to the Latin alphabet was provided, which meant the possibility of translation into other languages as well. Alongside that, the possibility of detecting spelling errors was also available. It was just enough to first convert the words to Latin, then again to Arabic to ensure all characters were written correctly.

The next step for me was to create an OCR for Azerbaijani Turkish with the Arabic alphabet. At first glance, it was possible to use the Tesseract library for Persian for Turkish, but its numerous errors in recognizing special characters of Azerbaijani Turkish dampened the initial excitement. Although this system now exists in Nasreddin, its performance is not very satisfactory.

In order, Nasreddin.org was born this way. A set of linguistic and transliterating tools for Azerbaijani Turkish alphabet.

Nasreddin

My work with this project (aside from necessary improvements) is considered complete, but Azerbaijani Turkish language needs several other things to survive.

  • Adding the character “وْ” to Unicode to facilitate and simplify typing Azerbaijani Turkish (the proposal is written and will be submitted soon.)
  • Official and standard keyboard for iOS and MacOS (a request to add Azerbaijani Turkish characters to the Persian keyboard or create a separate Azerbaijani Turkish keyboard for Apple has been sent.)
  • Official and standard keyboard for Windows (under research)
  • Official and standard keyboard for Linux (this exists, it just needs the character “وْ” to be added - under research)
  • Improvement and correction of official titles on various platforms (for example, a request to correct the language name in WordPress has been registered and approved and will be released soon.)
  • Localization registration and translation in various platforms and software like Ubuntu.
  • Registering Azerbaijani Turkish with the Arabic alphabet in Google Translate
  • Registering the standard layout of Azerbaijani Turkish with the Arabic alphabet in the National Standards Organization of Iran

As you can see, each of these tasks is extremely time-consuming, exhausting, and beyond the capabilities of one person. Since there is no financial or moral benefit for any of these tasks, the only motivation for these actions is the interest and heartfelt belief in popularizing the use of our mother tongue. If you also have this interest and heartfelt belief, contact me so that we can work together in this direction.

Araz Gholami | contact@arazgholami.com
April 25, 2024

--

--