-
Notifications
You must be signed in to change notification settings - Fork 0
wmrowan/CS124-hw7
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CS124 -- Programming Assignment 7 -- Machine Translation Kylie Poppen ([email protected]) Keegan Poppen ([email protected]) Bill Rowan ([email protected]) For this assignment, our group worked to write a translator from French to English. French and English are both Latin-based languages with Germanic influence, which simplifies many complicating factors of translation. Similar to English, French sentences tend to take the form of Verb-Subject-Object. In terms of translation, the most complicating factors were word ordering, prepositional differences, and complex negation. In terms of word order, adjectives, adverbs, and pronouns are typically syntactically different than english organization. English also has greater differentiation between prepositions, for example, the French word 'a' can mean to, toward, or of in English, along with a few other colloquial translations. The preposition, then, is translated based off of the surrounding context. Therefore, unigram translation will present an issue. The French language also has a more complex form of negation. As opposed to English, where negation is simply 'not' + verb, French negation follows the form 'ne' + verb + 'pas'. French also has more prevalent use of articles and has more verb tenses than English (particularly the subjunctive tenses). Luckily, the unigram translation should not be affected by extra tenses. Original Test Document (excerpt from Chapter One of Le Petit Prince) ------------------------ Les grandes personnes m'ont conseillé de laisser de côté les dessins de serpents boas ouverts ou fermés, et de m'intéresser plutôt à la géographie, à l'histoire, au calcul et à la grammaire. C'est ainsi que j'ai abandonné, à l'âge de six ans, une magnifique carrière de peintre. J'avais été découragé par l'insuccès de mon dessin numéro 1 et de mon dessin numéro 2. Les grandes personnes ne comprennent jamais rien toutes seules, et c'est fatigant, pour les enfants, de toujours leur donner des explications. J'ai donc dû choisir un autre métier et j'ai appris à piloter des avions. J'ai volé un peu partout dans le monde. Et la géographie, c'est exact, m'a beaucoup servi. Je savais reconnaître, du premier coup d'œil, la Chine de l'Arizona. C'est très utile, si l'on est égaré pendant la nuit. J'ai ainsi eu, au cours de ma vie, des tas de contacts avec des tas de gens sérieux. J'ai beaucoup vécu chez les grandes personnes. Je les ai vues de très près. Ça n'a pas trop amélioré mon opinion. Quand j'en rencontrais une qui me paraissait un peu lucide, je faisais l'expérience sur elle de mon dessin n° 1 que j'ai toujours conservé. Je voulais savoir si elle était vraiment compréhensive. Mais toujours elle me répondait: "C'est un chapeau." Alors je ne lui parlais ni de serpents boas, ni de forêts vierges, ni d'étoiles. Je me mettais à sa portée. Je lui parlais de bridge, de golf, de politique et de cravates. Et la grande personne était bien contente de connaître un homme aussi raisonnable. Our System's Output ------------------------ The large people have I advisable to let side the drawings of snakes boas open or closed and of interest me rather to geography, to history, at calculation and grammar. This is so that I abandoned, to age of six years a magnificent career of painter. I discouraged summer by the failure of my drawing number A and of my drawing number 2. The large people include nothing never all only , and this is tiring , for the children , of always their give explanations. I therefore choose due an other job and I learned to pilot aircraft. I stolen a little everywhere in the world. And the geography, this is accurate, me served much. I knew to recognize, of first blow Look, the China of Arizona. This is very useful , if is is lost for the night. I had so, at course of my life, of heap of Contacts with of heap of people seriously. I lived much in the large persons. I the have views of very close. It has not too improved my opinion. When I met a which me seemed a little lucid , I did experience on it of my drawing No A that I retained always. I wanted to know if it was understanding really. But always it answered me : "It 's a hat." Then I not talking him of snakes boas, or virgin forests or stars. I put me to its reach. I talking him of bridge of golf , of policy and of ties. And the great person was well content to know a man also reasonable. Our Reordering Algorithm ------------------------ To reorder the output of the unigram translation algorithm in order to produce better constructed English. We successively apply a number of simple rules that reorder, change, and sometimes add or subtract words. The first of these is called AdjectiveBeforeNoun and simply swaps any noun adjective pairs where the adjective follows the noun. This is done iteratively within the sentence so that each noun will end up being pushed to the back of any sequence of consecutive adjectives that follow it. The intuition for this is simple. In English adjectives always proceed the noun they modify while in French they often follow. The next is called AdverbAfterVerb. This swaps any adverb verb pair so that the adverb follows the verb. Based on the original unigram output we discovered that while English adverbs may appear almost anywhere in the sentence, they often sat awkwardly right in front of the verb in our unigram translation. Even more errors seem to come out of direct object placement. In the French source, the direct objects frequently came before the verb so that we ended up with a subject-object-verb pattern. English, of course, always follows a subject-verb-object pattern. Unfortunately, it was difficult to come up with a rule that could fix this problem consistently. Because our direct objects were marked the same as the subject by the POS tagger, it is almost impossible to tell if a given noun is the direct object or part of the subject. The best heuristic we were able to come up with involved looking for a string of nouns followed by a verb. This is the basis of the DOAfterVerbs rule which switches the last noun and the verb in these cases, assuming that the noun preceding the verb is the direct object. Next, we note that English verbs require a "to" helper word when in their infinitive form. The unigram model translation model is unable, by itself to determine when this is necessary. Therefore, we have another rule InfinitiveVerbs that inserts the "to" whenever we see a preposition followed by a verb or two verbs in a row. English's indefinite article 'a' has a weird quirk to it that the word for word translation wouldn't be able to pick up on without any context. That is, when the noun being modified begins with a vowel, the proceeding indefinite article changes to an 'an'. This is the purpose of the AAn rule which was actually quite straightforward to implement as all the necessary information is readily available. We also found a number of important linguistic differences related to negation. The NotSwap rule simply reverse the two words that follow a 'not'. In French the words that follow a 'not' are by rule reversed relative to the proper English word ordering. This is a relatively simple rule to implement. The DoubleNeg rule applies to another interesting quirk of the French treatment of negation. That is, French makes heavy use of double negation where English would have one negative. In fact, the double reversal that applies in English doesn't seem to apply at all in French so two consequtive negatives still has the same meaning as a single negation. To fix this we compiled a set of negating English words and look for sequences of them, removing any extras. Many of the translation errors we encountered involved improper translation and placement of prepositions. The original French text seems to have many more prepositions than the proper English translation. Furthermore, two French prepositions a/au and de can be translated to any one of a number of English prepositions. In practice, de always get's translated to "of" even though the proper English translation depends heavily upon context. We developed a few rules to try and deal with this problem though we admit they are far from perfect. We just didn't have very much information to work with. The OrOf rule tries to find list of nouns and then removes the first 'or' and subsequent 'of's from the list. This applies for situations where there is a potentially long list of disjunctions that becomes translated directly as a series of "or of" phrases. English only requires the 'of' on the first element of the list and 'or's between subsequent elements. The ToThe rule applies in a very similar context. Just as we found there were many extraneous 'of's in the previous rule, we also found many extraneous 'to's in a similar list context. French also seems to include more definite articles in contexts that they are not needed in English. Therefore we also looked for lists of "to the"s and removed all the 'the's and all 'to's after the first. The OfStrip rule is similar in spirit to the previous two. In contexts where English would indicate noun modification with simple word order French will often include an extra 'of' that is not needed in English. It was tough to describe specific contexts where this was an issue but one we were definitely able nail it down was a V + 'of' + N sequence. In this case we simply remove the 'of' in the middle. Error Analysis -------------- Our translation is far from perfect. In fact, it doesn't really make any sense at all. It is so bad, in fact, that several people who we asked who were very familiar with the Little Prince were unable to recognize it on the basis of our translation. Why is this? Are we just really bad computer scientists? The process of translating the original French document has several parts. 1. Translate each word individually using the most common translation for that word. For this step we asked Google translate to translate each word one at a time. Since it had not the ability to infer the translation from context it in essence was doing a simple dictionary lookup and returning the most likely candidate. This, in the end, produced very bad output, in many cases returning not even a translation of the correct part of speech. 2. Tag the resulting text with the Stanford POS tagger. This process we also could not control. Especially because it had to work on completely nonsensical input the POS tagging ended up being hopelessly wrong, especially in biasing towards nouns in cases where the word would more correctly be tagged as a verb if only the initial translation had been better and the resulting text obeyed a more correct English word ordering. Of course, this is a chicken and the egg problem, the sentence could not correctly be reordered until the POS information was there but the POS tagger would get confused unless proper English word ordering rules were being applied. 3. Reorder the sentence given the available information. Only now do we have any chance to influence the process. The assignment passes of this task as a simple matter of enforcing certain English grammar rules when they differ from the equivalent French yet there is nothing we can do when the raw material we have to work with is so bad. If we're given incorrect translations for the words and those words are given bad POS tags then we can't possibly reorder the bad words based on bad tags to come up with the correct sentence. The best we can do is put lipstick on the pig, with a blindfold on, through a robotic arm that overshoots its instructed movements and provides no tactile feedback. A good example has to do with direct objects and indirect objects. In English we often use word order and prepositions to distinguish these. To correctly infer indirect objects from the original French we would ideally have a parse tree and precises preposition information. But since all French prepositions were translated as 'of' and all nouns whether subject or object in the original were simply labeled as 'NN' all the information that would need had already been thrown away in previous steps. To actually fix the errors in our translation we would need a completely different approach that involved access to more powerful tools. To start off with we would need a real parser, not just a POS tagger, for the original French. This would help us know which translation to select for words that could be either a noun or a verb in different contexts. The word 'pilot' is a good example from our test document. This also applies to prepositions too. Which could have very different translations to English equivalents depending on context in the original. This also assumes that we have a full dictionary that includes translations for multiple different senses of the word. We need more freedom to select the correct translation of any given word. As it is, we may end up with sentences full of nouns and no verbs. There is again nothing that pure reordering can do in that case. Once we done intelligent word for word translation we should then output this initial translation with the original French POS information attached. The French POS information is far more useful than fake English POS inferences derived from bad data. With this more accurate and richer starting point our existing approach to sentence reordering would have a much better shot at succeeding. Google Translate Output ------------------------ Large people advised me to leave out the drawings of boa constrictors open or closed, and devote myself instead to geography, history, arithmetic and grammar. Thus I gave up, at the age of six years, a magnificent career as a painter. I had been disheartened by the failure of my Drawing Number 1 and number 2 of my drawing. The grown-ups never understand anything by themselves, and it is tiresome for children, always explaining things to them. So I had to choose another profession, and I learned to pilot airplanes. I flew around the world. And geography, that's right, has served me. I can distinguish the first glance, China from Arizona. This is very useful if one is lost in the night. I have had during my life, lots of contacts with lots of serious people. I have lived much among adults. I saw them very closely. It has not much improved my opinion. Whenever I met one who seemed a bit lucid, I made the experiment of showing him my Drawing No. 1 I have always kept. I wanted to know if she was really understanding. But still she replied, "That is a hat." Then I would never speak or boa constrictors, or primeval forests, or stars. I was down to his level. I told him about bridge, golf, politics, and neckties. And large would be greatly pleased to know such a sensible man. Comparative Analysis ------------------------ The Google translate output is obviously much better than ours. Still, it made some of the same mistakes that we made. Because the Google approach relies on a statistical analysis though it is prone to messing up simple rules that would be better to be hard coded into the system in the manner of ours. In the last sentence above, for example, while we both correctly drop subsequent 'of's in the given list Google translate makes the mistake of dropping the first 'of' and not dropping the first 'or'. Where Google really shines is its ability to infer from the context surrounding a word what the correct choice for a translation of that word should be. Group Responsibilities ------------------------ We all worked together on the translation rules and the writeup.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published