Back in July, I was at the Mozilla summit in British Columbia, meeting with Mozillanoids from all over the world. The first question is always, “So what are you working on?”. When I explained the idea of Ubiquity to people, the most common reaction was “That’s cool, but it’s so… English-centric. How are you going to localize it?”
These people are right to think that localizing a linguistic UI is harder than localizing a graphical UI. The difficulty is twofold: First, there are simply more words that need translation in a linguistic UI. Second, the implementation of the parser is based on grammatical rules which are themselves language-dependent. For example, English puts verbs before objects, so it seems most natural to enter a command first and then its arguments. But plenty of other languages put verbs at the end of a sentence.
Fired up by conversations with an international crew of Mozilla localizers, and by an inspiring keynote speech from Mitchell Baker on (among other things) the importance of internationalization, I set to work and spent the next 48 caffeine-fueled hours of the summit writing a proof-of-concept Japanese parser for Ubiquity. (Japanese happens to be the language I know best other than English). I was able to show off the Japanese parser in action at my summit presentation.
“But how will this work in Finnish?” “How will it work in Thai?” asked the localizers and other interested parties who surrounded me after my presentation was done. Every human language has its own idiosyncrasies. Is Mozilla going to write a unique parser for every language in the world?
No. Here’s what’s going to happen: I’m going to write a plugin API for the parser, and then members of the global community are going to start writing parser plugins for their own languages. The parser localization must be parallelized, just as the localization of individual commands must be parallelized, and I have no doubt that our amazing community is up to the task. I’ve already seen so much interest in localizing Ubiquity on our support forum — particularly from Northern Europe — that I expect to see German and Danish translations of Ubiquity commands begin to miraculously appear on the Internet a few days after I put up the localization tutorial.
But before that can happen, I need to figure out how the parser plugin API should work.
I’ll start by highlighting the differences between English grammar and Japanese grammar. Each one of these differences corresponds to some chunk of the parsing algorithm that will need to be customized depending on the user’s choice of language. After that, I’ll generalize from talking about English and Japanese to talking about the many ways that human languages can differ from each other, and how these differences will have to be reflected in the parser plugin API.
Parsing Japanese
Here’s an English sentence which also happens to work as a Ubiquity command:
translate "hello" from english to japanese
Here’s a similar Japanese sentence:
「今日は」を日本語から英語に翻訳する
The first thing you’ll notice is that there are no spaces in this sentence. Written Japanese doesn’t use them. Furthermore, the current standards in Japanese keyboard input methods overloads the spacebar, as a way to choose between multiple characters with the same reading. So asking the user to insert spaces between words when entering a command isn’t reasonable. And a parsing strategy that starts with splitting on spaces — as our English parser does — is doomed.
The lack of spaces isn’t the only problem. Let’s break down that Japanese sentence with a word-by-word translation:
Japanese is what linguists call a SOV language, meaning “Subject – Object – Verb”, as opposed to English which is SVO or “Subject – Verb – Object”. The verb in Japanese normally comes at the end of a sentence, which means that if you’re entering Japanese in a natural-language way, you’ll expect to type in arguments first and the command last.
Take another look at that sentence. Each noun is followed by a word called a “particle”. These play the same role as the prepositions in an English sentence, but they come after the nouns they modify. 「から」(“kara”) means “from”, but it comes after the name of what you’re translating from. There’s also a particle 「を」(“wo”), which marks the direct object of the sentence — something that has no direct equivalent in English.
What does the parser need to do, therefore, in order to parse Japanese? It needs to have a way of splitting up words without depending on spaces. It needs to treat the last word in the sentence, instead of the first, as the verb name. And it needs to use the particles to decide which part of the input goes to each argument of the command.
In my proof-of-concept parser, instead of splitting on spaces, I split on particles — I searched the input string for every Japanese particle I know about, and split on each one. This doesn’t exactly get us down to individual words, but that’s OK! It separates the arguments from each other and from the verb, and that’s really all we care about.
Once that was done, teaching the parser to expect the verb at the end and to expect the “prepositions” (or in this case, particles) to come after the nouns was all very easy. (Customizing the word order shouldn’t really even require any new code — it should be parameterizable, i.e. we should be able to simply pass in some constant to tell the parser what word order to expect.)
With the verb identified and each other word assigned to one of its arguments, parsing is mostly done. There’s still the matter of having the noun-types produce suggestions for each argument value, ordering the suggestion list, and so on, but the logic for that stuff is not language-dependent. (Various strings used by the specific noun-types and the specific commands still have to be localized, but that’s the easy part. We already know how to localize string resources!
Generalizing to other languages
So we’ve seen that all the differences between English grammar and Japanese grammar — that is, all the ones that Ubiquity cares about — can be reduced to:
- A boolean telling whether verbs come first or last
- A boolean telling whether to use prepositions or “postpositions”
- A function which splits the input into words
So is that all our parser plugin API needs to support? Well, somehow I doubt it’s that easy. Other languages have other idiosyncrasies not reflected here. If I’m remembering my high school Latin right, there are languages that decline nouns instead of (or in addition to) using prepositions — that is, the information about how each noun relates to the verb is encoded in noun suffixes, some of them irregular. And in (Mandarin) Chinese, which I’m starting to study now, the word order is very important, and there are all sorts of “counter words” and “auxiliary verbs” and other interesting constructs that I am just beginning to dimly understand.
In order to design this API and make it general enough to handle the full range of languages, without asking every localizer to start from scratch with the parser logic, I’m going to need to know a lot more than I currently do about the range of language behaviors that are out there. I’m already working on researching this, but I’d love to hear from you readers:
What does the sentence “Translate foo from bar to baz” look like in your favorite language? How does a listener know which word is the thing to be translated, which is the language to translate to, and which is the language to translate from? What sort of customizations would you need to make to the parsing algorithm in order to make it work with your language?
どうもありがとう
謝謝
Thank you!
October 1, 2008 at 3:22 am
You’re really doing some interesting thinking here. One problem I see is that inflected languages care not a whit for word order, so a boolean saying “verb is first or last” won’t help. The subject, verb, and object are all available only through declensions and conjugations. Damn right I’m going to want Ubiquity in my native Ancient Greek!
October 1, 2008 at 3:58 am
For Russian, the sentence in its most common form is probably :
Переведи foo с bar на baz
which is similar to English, but there are some alternate word orders that mean the same thing,
Переведи foo на baz с bar
being most obvious. You could also say “Translate foo to baz from bar” in English, of course.
October 1, 2008 at 4:06 am
Random thought:
Natural language seems important for the the obvious reasons, but might there also be an argument to also be made for *unnatural* language input? I’m thinking of cases where prior input can be used to provide context for following input (eg, smart autocompletion), which allows for faster command entry. There might be value in supporting some form of unnatural-but-fast entry. Then again, if there a clever way to enable natural-and-fast entry, all the better.
October 1, 2008 at 4:30 am
I have friends who work on this sort of thing academically, and they say to warn you that it gets a lot hairier than you may realize. Examples – in Hebrew, verb tense is expressed by the vowels, which are not normally written down; in Sanskrit a direct object is always the second word in the sentence, even if that puts it smack in the middle of a subclause that’s about something else; in English, the word “however” can appear between subject and verb even though its role is more like a conjunction.
The person who works most directly on this said he couldn’t immediately think of any books specifically on topic but would keep thinking about it and get back to me. Another person suggested Adele Goldberg’s Constructions for general background, and also the work of Joan Bybee.
Despite that various people keep pushing it, Chomsky’s theory of “Universal Grammar” has been completely and comprehensively discredited — for every rule he proposes, there exists at least one human language which does not conform. Do not rely on it; do not listen to people who advocate it.
October 1, 2008 at 5:29 am
Hmm,
Since you are using Kunrei-shiki rōmaji, I think you should make a note somewhere that は (ha) is pronounced “wa” and を (wo) is pronounced “o” when used as particles. And what if the user types some sentence final particles, ね?
Phil
October 1, 2008 at 5:30 am
Hi Dolske,
Good point. I was going to mention something about this in the article, but decided it was already too long. But since you asked…
You know how the English parser does noun-first completion when there isn’t a verb match? That is (in the trunk, not in the 0.1.1 release), you can type “tomorrow” and you’ll get a suggestion “check-calendar tomorrow”. So you can actually start with the noun, in the English parser, even though you’re “supposed” to start with the verb.
The parser isn’t a strict elementary-school grammar teacher who demands that all your sentences conform to a standard of correctness. What a pain that would be to use! For efficiency, you can skip words and abbreviate words like crazy, and the parser will do its best to figure out what you meant based on whatever input it gets. The grammar model is just a tool to help it figure out your meaning. It’s descriptive, not proscriptive.
This is how it works in English and this is how it should work in all other languages, too. What makes this work is mostly in the autocompletion and suggestion logic, which should be able to work without modification in any language.
October 1, 2008 at 5:47 am
in portuguese (i’m from brasil, but the difference is irrelevant to the parser) that’s what’ll get
traduzir foo de bar para baz
traduzir is the verb, which could be located at the end, however to command something to someone(the browser) the verb makes more sence at the start.
de is a modifier akin to from as para is akin to to 🙂
the word or sentence between traduzir and the modifier de is what should be translated
fell free to ask more
October 1, 2008 at 7:21 am
French is very similar to English in word order, but you have to put determiners before language names:
Traduire “foo” [du |de l’]bar [vers le|vers l’]baz
I think that you could strip the determiners entirely (le, l’), the latter having its boundary on a non-space character. And then I see at least three ways to say “from” (de, du, depuis) and two to say “to” (vers, en). Actually, since “en” only means “in”, it could also be used to mean “from”. The word order could vary, except for the verb which is always first.
Some examples:
Traduire “bonjour” du français vers l’anglais
Traduire “ciao” de l’italien vers le français
Traduire “salut” de français en anglais
Traduire “salut” en français vers l’anglais
Traduire “ciao” vers l’anglais depuis l’italien
etc.
October 1, 2008 at 7:49 am
Thanks for everyone’s feedback so far!
Zack, I’m aware that I’m probably biting off more than I can chew, but remember that we’re only dealing with commands, i.e. imperative sentences, which simplifies things considerably.
Philip: For the same reason, I’m not overly concerned about は or ね particles — not relevant to imperative sentences.
October 1, 2008 at 9:13 am
Benoit: so technically, is it possible to say:
Traduire “salut” en français en anglais
meaning: Translate “salut” in French to English?
For Polish (my native language), we decline nouns. We have a distinction between direct objects and indirect object (similarly to French for example). So in
Translate “hello” from English to Polish
Przetłumacz “hello” z angielskiego na polski
– “hello” is the direct object,
– “English” is indirect object (inflected form: English = angielski, from English = z angielskiego),
– “Polish” is 2nd indirect object (inflected form: Polish = polski, to Polish = na polski; it *is* inflected, but both cases are the same in this case).
So in general you could say that for indirect objects, we use a preposition + inflected noun (declension is done by changing suffixes, although the stem changes irregularly sometimes).
The tricky part is that each verb “requires” different prepositions and cases – there is no rule to it. One preposition with a verb may mean something different with another verb. Consider this example:
Send “hello” to Anna.
Wyślij “hello” Annie OR Wyślij “hello” do Anny
As you can see, two versions are possible:
1. to Anna = Annie (no preposition, the inflected form is enough)
2. to Anna = do Anny (with preposition which “requires” a different case!)
In English, “translate” and “send” both work with “to” in order to specify the target. In Polish, “translate” requires the use of “na angielski” whereas “send” needs either “do Anny” or “Annie” – no preposition at all in the latter form.
You venture out into some really complicated universe, but it gives me hope that Ubiquity will be equally useful for non-English speakers as well. Im not sure if this is really related, but maybe Pike’s work on l20n could somehow help here?
Great post, keep them coming!
October 1, 2008 at 10:29 am
What you need isn’t for the plugin to set a bunch of flags, but instead to do the parsing itself. Just call a function provided by the plugin that takes the current input and returns a data structure containing all the information that can be gleaned from it. Something like this, perhaps:
{ command: null,
args: [ { type: “string”, value: “今日は” },
{ type: “iso639-3”, value: “eng”, info: “英語に” },
]
}
This represents incomplete input. Note that at this point, since the command is unknown, you will want to make suggestions that are commands as well as arguments.
Also, you may want to take a look at trueknowledge.com to see how they are designing their natural-language interface. Email me if you want an invite.
October 1, 2008 at 12:35 pm
Note that in Russian the nouns must be inclined, that is, in «Переведи foo с bar на baz» «bar» would be in genitive and «baz» would be in accusative («Переведи „Hello“ с английского на французский», not
«Переведи „Hello“ с английский на французский»), so you’ll have to apply stemming.
October 1, 2008 at 2:20 pm
Dutch is simple:
Vertaal foo van bar naar baz
October 1, 2008 at 2:21 pm
In Dutch, the sentence translates nicely into:
“Vertaal foo van het bar naar het baz”, for instance:
“Vertaal ‘hallo’ van het Nederlands naar het Engels”.
‘van’ can be replaced by ‘uit’ or ‘vanuit’, and ‘het’ can be left out. Also, the subject and object can be swapped, so the next sentence is also fully correct (albeit sounding a bit strange):
“Vertaal ‘hallo’ naar Engels vanuit het Nederlands”.
October 1, 2008 at 7:29 pm
I agree with Daniel on the general engineering principle — you don’t want properties for your language, you want a parsing function. You should provide handy functions the parsing function can use, so that perhaps languages that are only slightly different than English can use the English parsing subroutines with slightly different arguments. This gives each language the flexibility to parse in as complex a way as it wants, while not being any more difficult to implement.
Probably the more complex situation is when the parsing is ambiguous. I imagine that this will happen a lot when the parsing is incremental, and the user hasn’t yet entered a complete statement. Returning an object with missing pieces represents one kind of ambiguity (like in Japanese where the verb comes last). Maybe returning a list of possible parsings would also make sense, if for instance there was ambiguity about what the direct object is. This isn’t too bad for Ubiquity, as you can just present the user with all the possible actions from all the possible parsings.
October 1, 2008 at 9:52 pm
Staś: well, technically nobody would say that. Or maybe with a special intonation, and probably with a long pause and “mais” (=but) to introduce the second clause, like this:
“Comment on dit la même chose que ‘salut’ en français, mais en anglais ?”.
Now your example is indeed ambiguous, and to resolve it you’d have to know in which language the string was in the first place. If you know that “salut” is in French, it becomes redundant. So what is left is:
Traduire “salut” en anglais.
I don’t know it Ubiquity could guess the language of the query to resolve such ambiguities, but I’m not sure it’s worth the trouble anyway. People will learn by themselves to use the right syntaxic form if they see the result is not what they expected. I think I saw in some demos that Ubiquity reformulates the query in terms it understands and shows it to the user, so that’s easy to check and remember.
October 2, 2008 at 9:07 am
Italian constructs almost like English, but it has some complications.
For example,
translate ciao from italian to english
becomes
traduci ciao dall’italiano all’inglese
Prepositions behave differently due to the first letter of the next word. Here both words start with a vowel, so we use the apostrophe without space.
traduci ciao dall’italiano al cinese
If the word after preposition is a consonant, we don’t use apostrophe, but space.
The “it” word could be another problem. In Italian we could write
traduci questo
where “questo” is “this”, but it sounds definitely better to write
traducilo
where the -lo suffix stands for “it”.
Obviously we could solve with a less natural, but simpler to implement solution.
October 2, 2008 at 9:16 am
In spanish you say
traducir foo del español al ingles
From and to are inverted
Translate this from spanish to english
9 of 10 times I have to translate a word from english to spanish. It would be very usefull to hace two predefined (and configurable) suggestions.
translate foo
suggestions:
translate foo from english to spanish
translate foo from spanish to english
PS: Let me know if you need a volunteer for the spanish localization
October 2, 2008 at 12:55 pm
Rightyho, in Finnish you would would have:
“Käännä foo bar*sta baz*ksi”, where * is some vowel, depending on the value for bar and baz.
But you also need to take into account that both bar and baz may be conjugated in different ways, depending on the stems of the words.
For instance, Finnish is “suomi” and Swedish is “ruotsi”. But “from Finnish” is “suomEsta” whereas “from Swedish” is “ruotsista”, since “suom-” is the stem in suomi and therefore the final i in “suomi” is conjugated away. And this is only the tip of the iceberg regarding Finnish grammar. *sigh*
In Swedish though, it’s extremely straight forward:
“Översätt foo från bar till baz”.
October 2, 2008 at 11:03 pm
I suspect that Daniel and Ian are right that you should let people plug in whole parsers, at least in the beginning. Later, you might be able to derive an API that simplifies the writing of a parser (perhaps even to the point where a person can write one declaratively) by factoring out similarities between existing parsers. But trying to do that up front seems like it’ll be much more painful.
For Hungarian, a common version of the sentence is:
Forditsd le barról bazra azt, hogy “foo”.
For example:
Forditsd le Angolról Japánra azt, hogy “hello”.
Where “forditsd” is the informal imperative definite conjugation of “translate”, “le” is a verb modifier (literally “down”), the suffix “ról” is the back-vowel declension of “off of”, the suffix “ra” is the back-vowel declension of “onto”, “azt” is the direct object declension of “that”, and “hogy” (“that”) identifies “hello” as the thing being represented by “azt”.
Hungarians may well simplify this when talking to a computer, however, f.e. dropping “azt, hogy”. They might also move words around, as word order rules are laxer, and variation in word order more frequent, in Hungarian than in English.
October 4, 2008 at 3:12 am
One issue that will have to be addressed is right-to-left support for Hebrew and Arabic. These languages also introduce a segmentation problem–they have case marking via prefixes and suffixes (rather than separate prepositions), and because vowels are left out, there is ambiguity as to whether something is a prefix/suffix or not.
To give a Hebrew example: I believe “translate ‘hello’ from Hebrew to English” would be:
לתרגם “שלום” מעברית לאנגלית
Represented in Roman characters (without vowels):
ltrgm “$lwm” m&bryt l@nglyt
Modern Hebrew, like English, is SVO. The m- in the third word and the l- in the fourth word are prefixes meaning “from” and “to”, respectively. The first word is the infinitival form “to translate” which can function as an imperative, though there is also a second-person form that can be used as well. Thanks to these prefixes, the three arguments can be listed in any order. There is also a b- prefix meaning “in” (or sometimes “with”), and a k- prefix meaning “as”. (However, there are plenty of words starting with ‘l’, ‘m’, ‘b’, or ‘k’ where they aren’t prefixes.) Prepositions for other relations function as independent words.
Another feature of Hebrew is that definite direct objects are preceded by a special particle, @t. I think most indirect objects are marked with the l- prefix, though, so it may suffice to ignore the @t.
I know less about Arabic, but I believe that Arabic has case-marking suffixes as well as prefixes, and that some of these are optional.
(P.S. I’d like to echo Zack’s shout-out to Goldberg and Bybee! Though of course Ubiquity will be handling much simpler syntax than a typical sentence in the language; presumably basic facts about word order, case-marking/prepositions/postpositions, and segmentation are all that’s necessary to establish the relationship between verbs and arguments. If commands could use multiple verbs at once, or modifiers akin to adjectives/adverbs in English, the task would probably be much harder.)
October 9, 2008 at 6:32 pm
[…] other things, 0.1.2 contains a preliminary version of the parser-localization API we discussed in my previous post. I took the advice of my (overwhelmingly helpful!) commenters and, instead of trying to factor out […]
October 29, 2008 at 2:27 am
There are a number of tools out there that will do morphological analysis of Japanese sentences for you. A really good one that I’ve used is MeCab (mecab.sourceforge.net). It’s fast, pretty accurate and has good bindings for Java, Python and a few other languages.
Afraid I can’t help you with any other languages though!
February 23, 2009 at 7:44 pm
[…] talked about similar localization issues several months ago, but Mitcho takes it further than I did. He says: In a verb-final language, […]
March 14, 2009 at 10:02 pm
[…] バビルの塔 « Not The User’s Fault (tags: ubiquity) […]
June 25, 2009 at 12:52 am
[…] But the old API was holding us back: it wasn’t extensible enough, it couldn’t support localization, and it was getting in the way of defining sane and consistent naming standards. Since our original […]
July 31, 2009 at 12:18 am
I’m Icelandic and really late 😉
In Icelandic language “You, translate ‘foo’ from Bar to Baz” would be:
,,Þýddu ,Foo’ úr Bar í Baz”.
where:
‘Þýddu’ means you shall translate. (We usually add a suffix to tell who is supposed to do it. It’s not needed, but we almost always include it.)
and ‘úr’ means ‘out-of’ and ‘í’ means ‘into’.
Similiar to English except we have some cases, that should all be accepted.
For example one could use [að] ‘þýða’ (formally ‘to “translate”‘ or someone should translate), ‘þýð’ (someone */shall/* translate (command)), ‘þýðið’ (though thats actually plural) or ‘þýtt’ (translated or translate in ‘could you translate’). In some cases the interpreter should make a difference, others it/he (the interpreter) should ignore the excact case.