GitHub - wmrowan/CS124-hw7

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
node_modules/underscore		node_modules/underscore
stanford-postagger		stanford-postagger
README		README
dictionary-scraper.js		dictionary-scraper.js
dictionary-translate.js		dictionary-translate.js
dictionary.json		dictionary.json
english-tagged.txt		english-tagged.txt
english.txt		english.txt
french.txt		french.txt
reorder.py		reorder.py
spanish.txt		spanish.txt
translate.sh		translate.sh

Repository files navigation

CS124 -- Programming Assignment 7 -- Machine Translation

Kylie Poppen ([email protected])
Keegan Poppen ([email protected])
Bill Rowan ([email protected])

For this assignment, our group worked to write a translator from French to English. French and English are both Latin-based languages with Germanic influence, which simplifies many complicating factors of translation. Similar to English, French sentences tend to take the form of Verb-Subject-Object. In terms of translation, the most complicating factors were word ordering, prepositional differences, and complex negation. In terms of word order, adjectives, adverbs, and pronouns are typically syntactically different than english organization. English also has greater differentiation between prepositions, for example, the French word 'a' can mean to, toward, or of in English, along with a few other colloquial translations. The preposition, then, is translated based off of the surrounding context. Therefore, unigram translation will present an issue. The French language also has a more complex form of negation. As opposed to English, where negation is simply 'not' + verb, French negation follows the form 'ne' + verb + 'pas'. French also has more prevalent use of articles and has more verb tenses than English (particularly the subjunctive tenses). Luckily, the unigram translation should not be affected by extra tenses.

Original Test Document (excerpt from Chapter One of Le Petit Prince)
------------------------
Les grandes personnes m'ont conseillé de laisser de côté les dessins de serpents boas ouverts ou fermés, et de m'intéresser plutôt à la géographie, à l'histoire, au calcul et à la grammaire. C'est ainsi que j'ai abandonné, à l'âge de six ans, une magnifique carrière de peintre. J'avais été découragé par l'insuccès de mon dessin numéro 1 et de mon dessin numéro 2. Les grandes personnes ne comprennent jamais rien toutes seules, et c'est fatigant, pour les enfants, de toujours leur donner des explications.
J'ai donc dû choisir un autre métier et j'ai appris à piloter des avions. J'ai volé un peu partout dans le monde. Et la géographie, c'est exact, m'a beaucoup servi. Je savais reconnaître, du premier coup d'œil, la Chine de l'Arizona. C'est très utile, si l'on est égaré pendant la nuit.
J'ai ainsi eu, au cours de ma vie, des tas de contacts avec des tas de gens sérieux. J'ai beaucoup vécu chez les grandes personnes. Je les ai vues de très près. Ça n'a pas trop amélioré mon opinion.
Quand j'en rencontrais une qui me paraissait un peu lucide, je faisais l'expérience sur elle de mon dessin n° 1 que j'ai toujours conservé. Je voulais savoir si elle était vraiment compréhensive. Mais toujours elle me répondait: "C'est un chapeau." Alors je ne lui parlais ni de serpents boas, ni de forêts vierges, ni d'étoiles. Je me mettais à sa portée. Je lui parlais de bridge, de golf, de politique et de cravates. Et la grande personne était bien contente de connaître un homme aussi raisonnable.

Our System's Output
------------------------

The large people have I advisable to let side the drawings of snakes boas open or closed and of interest me rather to geography, to history, at calculation and grammar. This is so that I abandoned, to age of six years a magnificent career of painter. I discouraged summer by the failure of my drawing number A and of my drawing number 2. The large people include nothing never all only , and this is tiring , for the children , of always their give explanations. I therefore choose due an other job and I learned to pilot aircraft. I stolen a little everywhere in the world. And the geography, this is accurate, me served much. I knew to recognize, of first blow Look, the China of Arizona. This is very useful , if is is lost for the night. I had so, at course of my life, of heap of Contacts with of heap of people seriously. I lived much in the large persons. I the have views of very close. It has not too improved my opinion. When I met a which me seemed a little lucid , I did experience on it of my drawing No A that I retained always. I wanted to know if it was understanding really. But always it answered me : "It 's a hat." Then I not talking him of snakes boas, or virgin forests or stars. I put me to its reach. I talking him of bridge of golf , of policy and of ties. And the great person was well content to know a man also reasonable.

Our Reordering Algorithm
------------------------

To reorder the output of the unigram translation algorithm in order to
produce better constructed English. We successively apply a number of simple
rules that reorder, change, and sometimes add or subtract words.

The first of these is called AdjectiveBeforeNoun and simply swaps any noun
adjective pairs where the adjective follows the noun. This is done iteratively
within the sentence so that each noun will end up being pushed to the back of
any sequence of consecutive adjectives that follow it.

The intuition for this is simple. In English adjectives always proceed the noun
they modify while in French they often follow.

The next is called AdverbAfterVerb. This swaps any adverb verb pair so that the
adverb follows the verb. Based on the original unigram output we discovered
that while English adverbs may appear almost anywhere in the sentence, they
often sat awkwardly right in front of the verb in our unigram translation.

Even more errors seem to come out of direct object placement. In the French
source, the direct objects frequently came before the verb so that we ended up
with a subject-object-verb pattern. English, of course, always follows a
subject-verb-object pattern. Unfortunately, it was difficult to come up with a
rule that could fix this problem consistently. Because our direct objects were
marked the same as the subject by the POS tagger, it is almost impossible to
tell if a given noun is the direct object or part of the subject. The best
heuristic we were able to come up with involved looking for a string of nouns
followed by a verb. This is the basis of the DOAfterVerbs rule which switches
the last noun and the verb in these cases, assuming that the noun preceding
the verb is the direct object.

Next, we note that English verbs require a "to" helper word when in their
infinitive form. The unigram model translation model is unable, by itself to
determine when this is necessary. Therefore, we have another rule
InfinitiveVerbs that inserts the "to" whenever we see a preposition followed by
a verb or two verbs in a row.

English's indefinite article 'a' has a weird quirk to it that the word for word
translation wouldn't be able to pick up on without any context. That is, when
the noun being modified begins with a vowel, the proceeding indefinite article
changes to an 'an'. This is the purpose of the AAn rule which was actually
quite straightforward to implement as all the necessary information is readily
available.

We also found a number of important linguistic differences related to negation.

The NotSwap rule simply reverse the two words that follow a 'not'. In French
the words that follow a 'not' are by rule reversed relative to the proper
English word ordering. This is a relatively simple rule to implement.

The DoubleNeg rule applies to another interesting quirk of the French treatment
of negation. That is, French makes heavy use of double negation where English
would have one negative. In fact, the double reversal that applies in English
doesn't seem to apply at all in French so two consequtive negatives still has
the same meaning as a single negation. To fix this we compiled a set of
negating English words and look for sequences of them, removing any extras.

Many of the translation errors we encountered involved improper translation and
placement of prepositions. The original French text seems to have many more
prepositions than the proper English translation. Furthermore, two French
prepositions a/au and de can be translated to any one of a number of English
prepositions. In practice, de always get's translated to "of" even though the
proper English translation depends heavily upon context. We developed a few
rules to try and deal with this problem though we admit they are far from
perfect. We just didn't have very much information to work with.

The OrOf rule tries to find list of nouns and then removes the first 'or' and
subsequent 'of's from the list. This applies for situations where there is a
potentially long list of disjunctions that becomes translated directly as a
series of "or of" phrases. English only requires the 'of' on the first element
of the list and 'or's between subsequent elements.

The ToThe rule applies in a very similar context. Just as we found there were
many extraneous 'of's in the previous rule, we also found many extraneous 'to's
in a similar list context. French also seems to include more definite articles
in contexts that they are not needed in English. Therefore we also looked for
lists of "to the"s and removed all the 'the's and all 'to's after the first.

The OfStrip rule is similar in spirit to the previous two. In contexts where
English would indicate noun modification with simple word order French will
often include an extra 'of' that is not needed in English. It was tough to
describe specific contexts where this was an issue but one we were definitely
able nail it down was a V + 'of' + N sequence. In this case we simply remove
the 'of' in the middle.

Error Analysis
--------------

Our translation is far from perfect. In fact, it doesn't really make any sense
at all. It is so bad, in fact, that several people who we asked who were very
familiar with the Little Prince were unable to recognize it on the basis of our
translation.

Why is this? Are we just really bad computer scientists? The process of
translating the original French document has several parts.

1. Translate each word individually using the most common translation for that
word. For this step we asked Google translate to translate each word one at
a time. Since it had not the ability to infer the translation from context
it in essence was doing a simple dictionary lookup and returning the most
likely candidate. This, in the end, produced very bad output, in many cases
returning not even a translation of the correct part of speech.

2. Tag the resulting text with the Stanford POS tagger. This process we also
could not control. Especially because it had to work on completely
nonsensical input the POS tagging ended up being hopelessly wrong,
especially in biasing towards nouns in cases where the word would more
correctly be tagged as a verb if only the initial translation had been
better and the resulting text obeyed a more correct English word ordering.
Of course, this is a chicken and the egg problem, the sentence could not
correctly be reordered until the POS information was there but the POS
tagger would get confused unless proper English word ordering rules were
being applied.

3. Reorder the sentence given the available information. Only now do we have
any chance to influence the process. The assignment passes of this task as a
simple matter of enforcing certain English grammar rules when they differ
from the equivalent French yet there is nothing we can do when the raw
material we have to work with is so bad. If we're given incorrect
translations for the words and those words are given bad POS tags then we
can't possibly reorder the bad words based on bad tags to come up with the
correct sentence. The best we can do is put lipstick on the pig, with a
blindfold on, through a robotic arm that overshoots its instructed movements
and provides no tactile feedback.

A good example has to do with direct objects and indirect objects. In
English we often use word order and prepositions to distinguish these. To
correctly infer indirect objects from the original French we would ideally
have a parse tree and precises preposition information. But since all French
prepositions were translated as 'of' and all nouns whether subject or object
in the original were simply labeled as 'NN' all the information that would
need had already been thrown away in previous steps.

To actually fix the errors in our translation we would need a completely
different approach that involved access to more powerful tools. To start off
with we would need a real parser, not just a POS tagger, for the original
French. This would help us know which translation to select for words that
could be either a noun or a verb in different contexts. The word 'pilot' is a
good example from our test document. This also applies to prepositions too.
Which could have very different translations to English equivalents depending
on context in the original.

This also assumes that we have a full dictionary that includes translations for
multiple different senses of the word. We need more freedom to select the
correct translation of any given word. As it is, we may end up with sentences
full of nouns and no verbs. There is again nothing that pure reordering can do
in that case. Once we done intelligent word for word translation we should then
output this initial translation with the original French POS information
attached. The French POS information is far more useful than fake English POS
inferences derived from bad data.

With this more accurate and richer starting point our existing approach to
sentence reordering would have a much better shot at succeeding.

Google Translate Output
------------------------
Large people advised me to leave out the drawings of boa constrictors open or closed, and devote myself instead to geography, history, arithmetic and grammar. Thus I gave up, at the age of six years, a magnificent career as a painter. I had been disheartened by the failure of my Drawing Number 1 and number 2 of my drawing. The grown-ups never understand anything by themselves, and it is tiresome for children, always explaining things to them.

So I had to choose another profession, and I learned to pilot airplanes. I flew around the world. And geography, that's right, has served me. I can distinguish the first glance, China from Arizona. This is very useful if one is lost in the night.

I have had during my life, lots of contacts with lots of serious people. I have lived much among adults. I saw them very closely. It has not much improved my opinion.

Whenever I met one who seemed a bit lucid, I made the experiment of showing him my Drawing No. 1 I have always kept. I wanted to know if she was really understanding. But still she replied,

"That is a hat."

Then I would never speak or boa constrictors, or primeval forests, or stars. I was down to his level. I told him about bridge, golf, politics, and neckties. And large would be greatly pleased to know such a sensible man.

Comparative Analysis
------------------------

The Google translate output is obviously much better than ours. Still, it made
some of the same mistakes that we made.

Because the Google approach relies on a statistical analysis though it is prone
to messing up simple rules that would be better to be hard coded into the
system in the manner of ours. In the last sentence above, for example, while we
both correctly drop subsequent 'of's in the given list Google translate makes
the mistake of dropping the first 'of' and not dropping the first 'or'.

Where Google really shines is its ability to infer from the context surrounding
a word what the correct choice for a translation of that word should be.

Group Responsibilities
------------------------

We all worked together on the translation rules and the writeup.