Lab 3: Pig Latin Translator

COMP 225

Programming Assignment 3

Pig Latin Translator

Due on Monday, March 11, at 11:45 pm.

Contents

Goals

Learn how to use regular expressions and substitution.
Learn how to "translate" a web page.

Overview

In lecture, we have discussed Ubby-Dubby "language" and how to use perl substitution operator to translate from English to Ubby-Dubby. A similar "language" is Pig Latin. There exist various Pig Latin dialects. We shall use the following simple rules.

If the first letter of the word is a vowel ("a", "e", "u", "i", or "o"), then append "way" to the end of the word.
Otherwise:

Move all letters before first useable vowel to end of word. A usable vowel is ...

any "a", "e", "u", "i", or "o"
any "y" that is not the first character of the word

Add "ay" to the end of the word

(These rules are not perfect, but simple.)

Here are some examples:

English	Pig Latin
some	omesay
ant	antway
jdk	jdkay
great	eatgray
yacht	achtyay
kybosh	yboshkay

Your Task

We ask you to write two English-to-Pig Latin translators.

The first translator (please call it translator.pl) should translate plain text files. Every word in the test file should be translated according to the rules described above and the output should be printed to the standard output. We ask you to use a series of substitution operators to do the translation similar to the way I wrote the Ubby-Dubby translator. You must use substitution operators and nothing else to receive the full credit for this part. I managed with just two substitution operators (plus two capitalization operators described below).

Also, you should handle words with upper case letters right. See below for hints on how to do that.

Capitalization. Handling capitalization along with translation would be quite tricky. What I suggest you to do instead is to do the capitalization as post-processing. First, you transform English words to Pig Latin without changing case of any characters. After you do English-to-Pig Latin translation, apply the following two substitution operators to the result:

��� s/\b([a-z]+)([A-Z])([a-z]+)\b/\u$1\l$2$3/g;

�

����s/\b([A-Z]{2,})([a-z]{2,3})\b/\U$1$2\E/g;

The first operator replaces the words with a single upper-case letter in the middle with the same word but with the first letter being upper-case. For example, the English word They would be first transformed by one of your substitution operators into eyThay and then, by this substitution, into Eythay.

The second operator replaces the words that have several upper-case letters followed by 2 or 3 lower case letters to all-upper case. For example, USA would first become USAway which this substitution would change to USAWAY.

This should take care about 99.9% of the cases and that is good enough.

Of course, you may use any other method if you prefer. But as the end result you should:

Transform a word with first upper case letter only into another word with first upper case only. For example, That should be translated into Atthay, not atThay.
Transform all-cap words into all-cap words. For example, JDK should become JDKAY, not JDKay.
Other cases like McDonald may be handled in any way you like

Specifying the input file name. Perfectly, it should be a command-line argument of your script. But that would make running your script from UltraEdit somewhat more awkward. So, you may set the file name inside your script. In the version of the translator that you submit, please use the input file name "test.txt" (this will make it easier for us to grade).

The above translator works fine for text files but what if we want to translate a web page? A web page is written in HTML and contains some special HTML tags and escape sequences along with content text. We only want to translate the content of the web page leaving the HTML tags and web page formatting unchanged.

The HTML tags look like <a href="index.html">, i.e. it starts with a left angled bracket "<" and ends with right angled bracket ">". In between there may be anything but a right angled bracket ">".

Escape sequences are used to present certain symbols that can not be presented in HTML verbatim. For example, since "<" is a special HTML symbol, to enter the sign "<" in HTML document, one uses the escape sequence < Other escape sequences are  , &, ... Every escape sequence starts with the symbol "&" and ends with the symbol ";".

So, your task is to translate everything except the text enclosed in < ... > or &...;. We do not require you to do this translation using the substitution operator only. You will probably find regular expression helpful, but you may use any other perl language constructs -- if - else statements, arrays, split operator -- that you find suitable for the task.

Please call the second translator webtranslator.pl. It should read from a web page (the source code for this page is good) and output to the file pig.html. The resulting page should be a valid html file and have all formatting of the original page intact.

The rules that I described above are a bit simplistic. There are some other parts of web pages that should be ignored, for example, javascript code. For the sake of simplicity, we do not require you to handle those things properly. This means that your translator might not work right with every web page on the Internet. It should work fine with the majority of the pages though. In particular, it should work quite well with this page (and this should be enough for testing).

Input and output file naming. Please, in the version that you submit, read from the file pa3.htm and write to the file pig.html.

Files you will need

You do not need to download any files for this assignment. To save this page to a file (to be used for testing of your webtranslator), use Save As menu command of your web browser. You may check how the translated page should look like here.

You may find the source code for Ubby-Dubby translator useful.

Files to Submit

You should submit two separate files translator.pl and webtranslator.pl (one per each part of the assignment). If there is anything unusual about your implementation, you may submit the file Readme.txt with explanations. (Readme.txt has to be a plain text file.) No other documentation is necessary. However, having comments in your code might help the grader to fix it and thus give you partial credit in case that something does not work right.

Testing

For the first part, I suggest to use some random text about a page long. You may want to make up some separate words with different capitalization and 'y' at different places since these parts will likely to cause problems.

For the second part, testing your script on this page should be enough.

Hints

In your substitution regular expressions, you will likely be using the following character classes [aioueAIOUE] and [bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ] quite often. It makes sense to declare the variables

�                $v = "[aioueAIOUE]";

�                $c = "[bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ]";

and then use these shorter symbols inside your regular expressions.

The order of the substitution operators matter! They are applied sequentially one after another, so you will need to make sure that the result of your former substitution does not match any of the latter ones. Otherwise, you will be applying transformations to the same word several times!

Last updated on 3/6/2002 3:32:45 PM