Back in July, I was at the Mozilla summit in British Columbia, meeting with Mozillanoids from all over the world. The first question is always, “So what are you working on?”. When I explained the idea of Ubiquity to people, the most common reaction was “That’s cool, but it’s so… English-centric. How are you going to localize it?”
These people are right to think that localizing a linguistic UI is harder than localizing a graphical UI. The difficulty is twofold: First, there are simply more words that need translation in a linguistic UI. Second, the implementation of the parser is based on grammatical rules which are themselves language-dependent. For example, English puts verbs before objects, so it seems most natural to enter a command first and then its arguments. But plenty of other languages put verbs at the end of a sentence.
Fired up by conversations with an international crew of Mozilla localizers, and by an inspiring keynote speech from Mitchell Baker on (among other things) the importance of internationalization, I set to work and spent the next 48 caffeine-fueled hours of the summit writing a proof-of-concept Japanese parser for Ubiquity. (Japanese happens to be the language I know best other than English). I was able to show off the Japanese parser in action at my summit presentation.
“But how will this work in Finnish?” “How will it work in Thai?” asked the localizers and other interested parties who surrounded me after my presentation was done. Every human language has its own idiosyncrasies. Is Mozilla going to write a unique parser for every language in the world?
No. Here’s what’s going to happen: I’m going to write a plugin API for the parser, and then members of the global community are going to start writing parser plugins for their own languages. The parser localization must be parallelized, just as the localization of individual commands must be parallelized, and I have no doubt that our amazing community is up to the task. I’ve already seen so much interest in localizing Ubiquity on our support forum — particularly from Northern Europe — that I expect to see German and Danish translations of Ubiquity commands begin to miraculously appear on the Internet a few days after I put up the localization tutorial.
But before that can happen, I need to figure out how the parser plugin API should work.
I’ll start by highlighting the differences between English grammar and Japanese grammar. Each one of these differences corresponds to some chunk of the parsing algorithm that will need to be customized depending on the user’s choice of language. After that, I’ll generalize from talking about English and Japanese to talking about the many ways that human languages can differ from each other, and how these differences will have to be reflected in the parser plugin API.
Here’s an English sentence which also happens to work as a Ubiquity command:
translate "hello" from english to japanese
Here’s a similar Japanese sentence:
The first thing you’ll notice is that there are no spaces in this sentence. Written Japanese doesn’t use them. Furthermore, the current standards in Japanese keyboard input methods overloads the spacebar, as a way to choose between multiple characters with the same reading. So asking the user to insert spaces between words when entering a command isn’t reasonable. And a parsing strategy that starts with splitting on spaces — as our English parser does — is doomed.
The lack of spaces isn’t the only problem. Let’s break down that Japanese sentence with a word-by-word translation:
Japanese is what linguists call a SOV language, meaning “Subject – Object – Verb”, as opposed to English which is SVO or “Subject – Verb – Object”. The verb in Japanese normally comes at the end of a sentence, which means that if you’re entering Japanese in a natural-language way, you’ll expect to type in arguments first and the command last.
Take another look at that sentence. Each noun is followed by a word called a “particle”. These play the same role as the prepositions in an English sentence, but they come after the nouns they modify. 「から」(“kara”) means “from”, but it comes after the name of what you’re translating from. There’s also a particle 「を」(“wo”), which marks the direct object of the sentence — something that has no direct equivalent in English.
What does the parser need to do, therefore, in order to parse Japanese? It needs to have a way of splitting up words without depending on spaces. It needs to treat the last word in the sentence, instead of the first, as the verb name. And it needs to use the particles to decide which part of the input goes to each argument of the command.
In my proof-of-concept parser, instead of splitting on spaces, I split on particles — I searched the input string for every Japanese particle I know about, and split on each one. This doesn’t exactly get us down to individual words, but that’s OK! It separates the arguments from each other and from the verb, and that’s really all we care about.
Once that was done, teaching the parser to expect the verb at the end and to expect the “prepositions” (or in this case, particles) to come after the nouns was all very easy. (Customizing the word order shouldn’t really even require any new code — it should be parameterizable, i.e. we should be able to simply pass in some constant to tell the parser what word order to expect.)
With the verb identified and each other word assigned to one of its arguments, parsing is mostly done. There’s still the matter of having the noun-types produce suggestions for each argument value, ordering the suggestion list, and so on, but the logic for that stuff is not language-dependent. (Various strings used by the specific noun-types and the specific commands still have to be localized, but that’s the easy part. We already know how to localize string resources!
Generalizing to other languages
So we’ve seen that all the differences between English grammar and Japanese grammar — that is, all the ones that Ubiquity cares about — can be reduced to:
- A boolean telling whether verbs come first or last
- A boolean telling whether to use prepositions or “postpositions”
- A function which splits the input into words
So is that all our parser plugin API needs to support? Well, somehow I doubt it’s that easy. Other languages have other idiosyncrasies not reflected here. If I’m remembering my high school Latin right, there are languages that decline nouns instead of (or in addition to) using prepositions — that is, the information about how each noun relates to the verb is encoded in noun suffixes, some of them irregular. And in (Mandarin) Chinese, which I’m starting to study now, the word order is very important, and there are all sorts of “counter words” and “auxiliary verbs” and other interesting constructs that I am just beginning to dimly understand.
In order to design this API and make it general enough to handle the full range of languages, without asking every localizer to start from scratch with the parser logic, I’m going to need to know a lot more than I currently do about the range of language behaviors that are out there. I’m already working on researching this, but I’d love to hear from you readers:
What does the sentence “Translate foo from bar to baz” look like in your favorite language? How does a listener know which word is the thing to be translated, which is the language to translate to, and which is the language to translate from? What sort of customizations would you need to make to the parsing algorithm in order to make it work with your language?