During the last few days I've been improving a tool I created a long time ago, which was supposed to make it easier to have fun with Ithkuil. But let's start at the beginning
Ithkuil
Ithkuil is a constructed language created by John Quijada. Constructed languages (or "conlangs") are usually associated with children (I myself was creating my own languages when I was 10-12), but in this case you couldn't be further from the truth. Even though Ithkuil doesn't really have practical applications, I think it is unusually interesting.
Ithkuil emphasizes conveying as much information as possible, as concisely as possible. As a result, it has 45 consonants and 13 vowels, and almost every sound in a word carries a separate bit of information. How was this achieved?
In Ithkuil there are two main classes of words - formatives and adjuncts. Formatives function as nouns or verbs, adjuncts convey additional information about formatives and sometimes mimic the personal pronouns. Let's focus on formatives: each one consists of a root, which carries main information about the meaning of the word (like, for example, "oral sound"), which then can be inflected by over 20 different grammatical categories using numerous affixes. For example, the root for "oral sound" (-l-) can be inflected by adding "e-" in front -> "el-", making it "spoken utterance". To get the smallest possible word, we need another vowel and a consonant -> "elal". "a" marks the Oblique case, which is pretty neutral. "-l" on the other hand means that we are speaking of a single object, functioning as a separate whole, we mean it in its entirety and as a concrete object and not its mental representation. This way, "elal" can be translated just as "spoken utterance".
We can modify the ending a bit. Where we have "l" now, we can insert any of the over 1700 consonantal combinations, each of which defines the values of 5 grammatical categories. For example, the "-rtkʰ" ending would mean a mental representation of a single object, which consists of multiple non-identical parts described by the root that serve a common purpose together. We can also add another "a" for euphony - it won't change the meaning, since it is the default value in this case. So, we get "elartkʰa", which means a representation of multiple non-identical oral utterances, serving a common purpose - that is, a language ;) This is actually what "ithkuil" meant in an earlier version of the language and where it got its name from.
One of the best examples of Ithkuil's capabilities is shown on its website. I won't be rewriting it here, I'll just post the link: http://ithkuil.net/texts.html#duchamp
Since the words in Ithkuil can consist of over ten morphemes, each of which can have one of even over a thousand forms, translating texts from or into Ithkuil takes a lot of time and is mostly looking things up in tables on the website. It would be nice to speed this process up a bit.
Computer analysis
Computers to the rescue! Ithkuil, although complex, is very regularly structured. The words can only consist of some specific combinations of sounds, connected in a strictly defined way. This enables you to create a program which would decompose Ithkuil words into their basic parts - morphemes. Multiple projects aiming to do exactly this have appeared, one of which - created by me - is the topic of this post ;)
The basic analysis is very simple. As each morpheme in Ithkuil can consist either of vowels or of consonants, the program can instantly divide the word into morphemes. What's left is to associate the morphemes with their positions - so called "slots". This turns out to be harder.
The problem is that in order for the associations of morphemes to slots to be unique, the author had to introduce multiple complex rules. Some slots require other slots to be filled, some can only contain a limited subset of all possible combinations, some must be separated from others with glottal stops (denoted "’") under specific circumstances. The analysis is possible, but complicated.
The initial version of the program was focused on formatives - they are the most complex class, so being able to properly analyze them would be about 70% of total progress. The algorithm was contained in one long, ugly function, which recognized step by step if a given consonantal/vocalic block could be in a given slot or not. The program worked, but it was ugly and hard to extend, so I pretty much abandoned it after implementing the analysis of formatives.
A few days ago I got an idea to approach the problem differently and write a parser based on a formal grammar. Initially I wanted to create a context-free grammar, but later I changed my mind and chose a PEG ("Parsing Expression Grammar"). The reason was that PEGs are easier to transform into working parsers, but this choice made one thing much easier - I'll cover that in a moment.
When I was starting to write the grammar, I planned to use it to parse everything except the stress, which is also meaningful in Ithkuil. The stress is denoted in a very complex way, though, because many letters have diacritics and adding another tick above them would be too ugly. Actually, the stress would probably require a separate grammar, so it appeared that it would be really time-consuming to include it.
Fortunately, the PEGs have something called the &-predicate. A rule used in such a predicate tells the parser to check if the input stream satisfies the rule, but not consume any characters. This way we can match the input against multiple rules at once. That gave me the idea to use &-predicates to match the stress and then decompose the words using the normal rules. It turned out to be very practical and so the current version of the grammar actually works this way.
Associating the morphemes with their meanings was another problem. The roots are defined very arbitrarily, so I gave up on them at once, but the meanings of the affixes are strictly defined, so in theory nothing was in the way of the program describing the meanings of the parts of a word. In practice, though, the structure of the tables on the website made it a bit challenging. Each of them has its own structure, irregularities, differences in notation etc., which makes reading thousands of morphemes pretty hard. Fortunately, it was possible to download and parse each one separately. This allowed me to handle each of them on its own, which I did, although it was pretty boring. The result of this work is the database available in the repository.
Results
Ultimately I wrote a few programs:
- Python module mentioned earlier, containing basic functions for the analysis.
- A web application, allowing one to write a sentence in Ithkuil and get its analysis. The names of the categories are clickable - the site will then display a description taken from http://ithkuil.net
- A reddit bot that analyzes marked paragraphs and posts the results in an abbreviated form - so called "glosses". At the moment it only works on the testing subreddit, but I'm planning to switch it to /r/ithkuil soon.
Plans
I'm also planning a few things, but I'm starting to lose my interest in this project for now, so they'll probably have to wait a bit.
First, the tools I created until now only support the analysis of words, but they don't allow for word construction. It would be a nice feature to have. The morpheme database is fortunately already complete, so it's only the matter of some code, but it might be harder than it looks.
Second - Ithkuil has its own writing system. It doesn't capture the pronunciation like the Latin alphabet, though, it captures the grammatical categories. Having code capable of analysing the words, though, it should be fairly easy to transform the results of such an analysis into Ithkuil writing.
Third - there are already programs that can read text from pictures, and the Ithkuil characters aren't too complex. It would be nice then to have a system capable of reading the characters and outputting latinized text. This is a completely new topic for me though, so I'll probably come back to it in the far future ;)