Jim's
Tutorials

Spring 2012
course
navigation

2012-04-04

Aside

Work for Plan Project: I wrote a short program that does a quick string comparison. Similar to Hamming Distance, in that it looks at how many differences there are in the string. Then it divides this number by max(len(str1),len(str2)).

Text Classification

General Questions:
How to do it?
i.e., what means "happy"/"sad" or what determines categorization?
Naïve Bayes Categorization:
P(d|c).P(c) P(c|d) = ----------- P(d)
c=class, d=document
(In the lecture, MAP in c_MAP = "Maximum A Posteriori Probability")
c_NB = argmax P(c_j) . Product P(x|c) c in C x in X
Maximum Likelihood Estimates (MLE)
What do we do with unknown words?
P(w_u) = 1/|all counts|+|V+1|
Close relationship to lang. modeling.
| correct not correct ------------|------------------------------------ selected | true pos false pos not selected| false neg true neg (true pos + true neg) Accuracy = --------------------- (tp+tn+fp+fn)
Useful measure? Usually. But not in cases where we have very few true positives.
Precision:
Recall:
There is a trade-off. How important is precision? How important is recall?
NLP systems: there is almost always this trade-off.
F-measure: combination of precision and recall.(Weighted harmonic mean) - very conservative avg.
F = 1/(alpha(1/P)+(1-alpha)(1/R)) = (beta^2 +1)PR/(beta^2 . P)+R F_1 = 2PR/(P+R) Example: Given that a classification system has precision and recall of P=0.75, R=0.25:
Averaging: Micro and macro:
We need Train|Dev|Test sets. To avoid overfitting.
Easiest way to automated split:
Multiple splits for dev set. Compute pooled devset performance. Then still use testset.
Underflow prevention. Multiplying lots of fractions: floating pt. underflow.

Sentiment Analysis

Products/polls/anything really.
Look at all words, not just adj. We can get information from nouns etc.
Boolean Multinomial Naïve Bayes:
  1. Delete duplicate words
  2. Bayes rule as usual
Only changes between no occurrence and an occurrence of a word affect the calculation.
Train 1 Computer Science Computer Jim CS --> Computer Science Jim C(Computer|CS) = 3 2 Natural Language Processing Computer CS --> Natural Language Processing Computer 3 Syntax Computer Language Programming CS --> Syntax Computer Language Programming 4 Syntax Semantics Grammar Syntax LIN --> Syntax Semantics Grammar C(Semantics|LIN) = 1 Test 5 Semantics Grammar Language Computer ? LIN
There are many sentiment lexicons. Inquirer, Linguistic Inquiry and Word Count (LIWC), MPQA Subjectivity Clues Lexicon, ...
SentiWordNet Pos,Neg,Obj
What about if we want to make our own lexicon? Hatzivassiloglou and McKeown
Turney Algorithm
  1. Extract a Phrasal lexicon from reviews
  2. Learn polarity of each phrase
  3. Rate a review by the average polarity of its phrases
(from slides)
Extract two-word phrases with adjectives, with certain restrictions:
Intuition: Positive more often with "excellent", negative more often with "poor".
How to measure co-occurrence (how often they appear together)?
Mutual Information!

Pointwise Mutual Information (how much more do often words co-occur than if independent?):

The "words" can also be phrases, and the calculation is done in the same way. The probability of the individual N-grams in the phrase isn't important... it's the phrase itself.



Turney was shown to be 74% accurate.
It learns "domain specific information" (aka. "Field of Knowledge" information). Intuition:
Other sentiment tasks:
In-lecture problem: "Suppose we've found that "action scenes" is among the most frequent phrases in a set of movie reviews. Finding it in which of the following contexts would be most likely to lead us to select "action scenes" as a salient aspect/attribute/target for a review?"
Sometimes, the "aspect" isn't actually mentioned, rather a defining aspect of the aspect is reviewed.
How do we deal with that? Take a small corpus of restaurant reviews and label each sentence. It shouldn't take too long for "reasonably-sized" (depending on hardware) corpora.
Instead of finding the aspect from frequently-occurring phrases, we can determine it in advance (a priori). We then tell the classifier what we're looking for and find, for example, what the reviews about the food or décor at Bob's Restaurant were like.
Blair-Goldensohn algorithm:
Text Sentences Sentiment Aspect Reviews -> extractor -> and -> Classifier -> S&P -> Extractor -> S&P -> Aggregator -> Final Phrases
The baseline algo. assumes that classes (+/-) have equal frequencies. For toy examples, this is a valid assumption. But not always the case for the "real world."
We can use F-scores if the data is unbalanced.
Common solutions:

Assignment 3

Sentiment analysis: Classification of movie reviews into "positive" and "negative." I've started looking at/playing around with it. It draws on many things we've seen in the AI class, as well as some of the "information" theory stuff from the Information Theory class. It's looking at the whole review. So, I'll probably do a count of "polarity words" and their co-occurrences (maybe 2-/3-grams) and look at probabilities that way.
Another thing the assignment mentions is "stop word filtering." Words like "the," "and," "it" ... I'm interested to see what effect it has on the system. I guess it makes it easier to get "good" data. We'll see.
Pseudocode that I should follow for the algorithm (from Information Retrieval by Manning, Raghavan, Schütze):

Spelling Correction Assignment (week 2)

Completed and given files uploaded to Spelling Correction
http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ 2012-04-04
last modified Friday April 6 2012 2:15 pm EDT

attachments [paper clip]

     name last modified size
   compare.py Apr 6 2012 2:14 pm 1.38kB [IMG]pseudocode.png Apr 6 2012 11:28 am 66.7kB