COMP 225
Programming
Assignment 3
Pig Latin Translator
Due on Monday, March 11, at 11:45 pm.
Contents
Goals
- Learn how to use regular
expressions and substitution.
- Learn how to
"translate" a web page.
Overview
In lecture, we have discussed Ubby-Dubby "language" and how to use
perl substitution operator to translate from English to Ubby-Dubby. A similar
"language" is Pig Latin. There exist various Pig Latin dialects. We
shall use the following simple rules.
- If the first letter of the
word is a vowel ("a", "e", "u",
"i", or "o"), then append "way" to the end
of the word.
- Otherwise:
- Move all letters
before first useable vowel to end of word. A usable vowel is ...
- any "a",
"e", "u", "i", or "o"
- any "y"
that is not the first character of the word
- Add "ay" to
the end of the word
(These rules are not perfect, but simple.)
Here are some examples:
English
|
Pig Latin
|
some
|
omesay
|
ant
|
antway
|
jdk
|
jdkay
|
great
|
eatgray
|
yacht
|
achtyay
|
kybosh
|
yboshkay
|
Your Task
We ask you to write two English-to-Pig Latin translators.
- The first translator (please
call it translator.pl)
should translate plain text files. Every word in the test file should be
translated according to the rules described above and the output should be
printed to the standard output. We ask you to use a series of substitution
operators to do the translation similar to the way I wrote the Ubby-Dubby
translator. You must use substitution operators and nothing else to
receive the full credit for this part. I managed with just two
substitution operators (plus two capitalization operators described
below).
Also, you should handle words with upper case letters right. See below for
hints on how to do that.
Capitalization. Handling capitalization along with translation
would be quite tricky. What I suggest you to do instead is to do the
capitalization as post-processing. First, you transform English words to
Pig Latin without changing case of any characters. After you do
English-to-Pig Latin translation, apply the following two substitution
operators to the result:
s/\b([a-z]+)([A-Z])([a-z]+)\b/\u$1\l$2$3/g;
s/\b([A-Z]{2,})([a-z]{2,3})\b/\U$1$2\E/g; |
The first operator replaces the words with a single upper-case letter in
the middle with the same word but with the first letter being upper-case.
For example, the English word They would be first transformed
by one of your substitution operators into eyThay and then, by this
substitution, into Eythay.
The second operator replaces the words that have several upper-case
letters followed by 2 or 3 lower case letters to all-upper case. For
example, USA
would first become USAway
which this substitution would change to USAWAY.
This should take care about 99.9% of the cases and that is good enough.
Of course, you may use any other method if you prefer. But as the end
result you should:
- Transform a word with
first upper case letter only into another word with first upper case
only. For example, That
should be translated into Atthay, not atThay.
- Transform all-cap
words into all-cap words. For example, JDK should become JDKAY,
not JDKay.
- Other cases like McDonald
may be handled in any way you like
Specifying the input file name. Perfectly, it should be a command-line
argument of your script. But that would make running your script from UltraEdit
somewhat more awkward. So, you may set the file name inside your script. In the
version of the translator that you submit, please use the input file name "test.txt"
(this will make it easier for us to grade).
- The above translator works
fine for text files but what if we want to translate a web page? A web
page is written in HTML and contains some special HTML tags and escape
sequences along with content text. We only want to translate the content
of the web page leaving the HTML tags and web page formatting unchanged.
The HTML tags look like <a href="index.html">, i.e. it
starts with a left angled bracket "<" and ends with
right angled bracket ">".
In between there may be anything but a right angled bracket ">".
Escape sequences are used to present certain symbols that can not be
presented in HTML verbatim. For example, since "<" is a
special HTML symbol, to enter the sign "<" in HTML document,
one uses the escape sequence < Other escape sequences are ,
&, ... Every escape sequence starts with the symbol "&"
and ends with the symbol ";".
So, your task is to translate everything except the text enclosed in <
... > or &...;. We do not require you to do this translation using
the substitution operator only. You will probably find regular expression
helpful, but you may use any other perl language constructs -- if - else
statements, arrays, split operator -- that you find suitable for the task.
Please call the second translator
webtranslator.pl.
It should read from a web page (the source code for this page is good) and
output to the file pig.html.
The resulting page should be a valid html file and have all formatting of the
original page intact.
The rules that I described above are a bit simplistic. There are some other
parts of web pages that should be ignored, for example, javascript code. For
the sake of simplicity, we do not require you to handle those things properly.
This means that your translator might not work right with every web page on the
Internet. It should work fine with the majority of the pages though. In
particular, it should work quite well with this page (and this should be enough
for testing).
Input and output file naming. Please, in the version that you submit,
read from the file pa3.htm
and write to the file pig.html.
Files you will need
You do not need to download any files for this assignment. To save this page
to a file (to be used for testing of your webtranslator), use Save As
menu command of your web browser. You may check how the translated page should
look like here.
You may find the
source code for Ubby-Dubby translator useful.
Files to Submit
You should submit two separate files translator.pl and webtranslator.pl
(one per each part of the assignment). If there is anything unusual about your
implementation, you may submit the file Readme.txt with explanations. (Readme.txt
has to be a plain text file.) No other documentation is necessary. However,
having comments in your code might help the grader to fix it and thus give you
partial credit in case that something does not work right.
Testing
For the first part, I suggest to use some random text about a page
long. You may want to make up some separate words with different capitalization
and 'y' at different places since these parts will likely to cause problems.
For the second part, testing your script on
this page should be enough.
Hints
- In your substitution regular
expressions, you will likely be using the following character classes [aioueAIOUE]
and [bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ]
quite often. It makes sense to declare the variables
· $v = "[aioueAIOUE]";
· $c = "[bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ]";
and then use these shorter
symbols inside your regular expressions.
- The order of the
substitution operators matter! They are applied sequentially one after
another, so you will need to make sure that the result of your former substitution
does not match any of the latter ones. Otherwise, you will be applying
transformations to the same word several times!
Last updated on 3/6/2002 3:32:45 PM