# Week 5

In [2]:
import nltk  # make sure NLTK is installed and loaded

## Word Lists

Use the `nltk.corpus.words` wordlist to estimate the following for several text corpora.

In [4]:
nltk.download('words')
nltk.download('gutenberg')
from nltk.corpus import words
from nltk.corpus import gutenberg 

[nltk_data] Downloading package words to /home/goodmami/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


- Choose a variety of texts from the Gutenberg corpus. What percentage of the texts' vocabularies are not in the wordlist?

In [6]:
gutenberg.fileids()  # inspect available files

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [8]:
files = ['austen-emma.txt', 'milton-paradise.txt', 'shakespeare-macbeth.txt']  # choose some

In [10]:
WORDS = set(words.words())  # words.words() is a list; make it a set to speed up lookup
for file in files:
    # Get the list of words in the text and normalize early.
    # The isalpha() filter removes tokens like ",", but also valid words with punctuation.
    # How you normalize and filter is up to you.
    textwords = [w.lower() for w in gutenberg.words(file) if w.isalpha()]
    # Words not in the set of "known" words are often called OOV (out-of-vocabulary)
    oov = [w for w in textwords if w not in WORDS]
    print(file)
    print('  sample:', oov[:10])
    print('  percent of tokens that are unknown:', 100 * len(oov) / len(textwords))
    print('  percent of types that are unknown:', 100 * len(set(oov)) / len(set(textwords)))

austen-emma.txt
  sample: ['austen', 'seemed', 'blessings', 'years', 'youngest', 'daughters', 'died', 'caresses', 'supplied', 'years']
  percent of tokens that are unknown: 8.593440594059405
  percent of types that are unknown: 28.323209492866223
milton-paradise.txt
  sample: ['john', 'milton', 'eden', 'oreb', 'sinai', 'siloa', 'flowed', 'intends', 'aonian', 'pursues']
  percent of tokens that are unknown: 10.301516902090865
  percent of types that are unknown: 34.33983286908078
shakespeare-macbeth.txt
  sample: ['tragedie', 'macbeth', 'william', 'shakespeare', 'actus', 'scoena', 'witches', 'againe', 'raine', 'burley']
  percent of tokens that are unknown: 23.664623467600702
  percent of types that are unknown: 51.05889178996229


- What percentage of the wordlist are present in the texts?

In [12]:
for file in files:
    # Get the list of words, as before.
    textwords = [w.lower() for w in gutenberg.words(file) if w.isalpha()]
    # IV = in-vocabulary, by analogy to OOV
    iv = [w for w in textwords if w in WORDS]
    print(file)
    print('  percent of wordlist present in text:', 100 * len(set(iv)) / len(WORDS))


austen-emma.txt
  percent of wordlist present in text: 2.150984348769776
milton-paradise.txt
  percent of wordlist present in text: 2.498177131907822
shakespeare-macbeth.txt
  percent of wordlist present in text: 0.7151577840706764


## CMU Pronouncing Dictionary

Use the ARPABET transcriptions in the `nltk.corpus.cmudict` corpus to investigate sound patterns.

In [14]:
nltk.download('cmudict')
from nltk.corpus import cmudict

[nltk_data] Downloading package cmudict to /home/goodmami/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


In [16]:
cmu = cmudict.dict()
cmu['pronounce']

[['P', 'R', 'AH0', 'N', 'AW1', 'N', 'S']]

- Pick some minimal pairs and look at vowel differences (e.g., *pick* / *pack* / *peck* / *peak*)

In [18]:
for word in ('pick', 'pack', 'peck', 'peak'):
    print(word, '->', cmu[word])

pick -> [['P', 'IH1', 'K']]
pack -> [['P', 'AE1', 'K']]
peck -> [['P', 'EH1', 'K']]
peak -> [['P', 'IY1', 'K']]


- Devise a function for identifying rhyming words (how they are identified is up to you)

In [20]:
def rhymes(word1, word2):
    pron_list1 = cmu[word1]
    pron_list2 = cmu[word2]
    return any(p1[-2:] == p2[-2:]  # are the last 2 phonemes enough? are they too much?
               for p1 in pron_list1
               for p2 in pron_list2)

print('rhymes with "pack":')
for other in ('pick', 'peck', 'peak', 'back', 'track'):
    print('  ', other, rhymes('pack', other))

rhymes with "pack":
   pick False
   peck False
   peak False
   back True
   track True


* (extra) why doesn't something like "smokestack" or "quarterback" rhyme with "pack" according to the function above?

In [22]:
for word in ('pack', 'smokestack', 'quarterback'):
    print(word, cmu[word])

pack [['P', 'AE1', 'K']]
smokestack [['S', 'M', 'OW1', 'K', 'S', 'T', 'AE2', 'K']]
quarterback [['K', 'W', 'AO1', 'R', 'T', 'ER0', 'B', 'AE2', 'K'], ['K', 'AO1', 'R', 'T', 'ER0', 'B', 'AE2', 'K']]


- What are the largest clusters of rhyming words?

In [24]:
# first create a structure mapping each rhyming scheme to the list of words ending in the scheme
clusters = {}
for word, pron_list in cmu.items():
    for pron in pron_list:
        scheme = tuple(pron[-2:])  # make it a tuple so it can be a dictionary key
        # initialize an empty list if we haven't seem the rhyming scheme before
        if scheme not in clusters:
            clusters[scheme] = []
        clusters[scheme].append(word)

# To find the largets cluster, we could go through each and keep track of the largest we've seen:
max_scheme = None
max_value = 0
for scheme, cluster in clusters.items():
    if len(cluster) > max_value:
        max_scheme = scheme
        max_value = len(cluster)
print('largest cluster')
print('  scheme:', max_scheme)
print('  size:', len(clusters[max_scheme]))

# Alternatively, use the max() function with a "lambda" expression (like an inline function)
max_scheme = max(clusters, key=lambda scheme: len(clusters[scheme]))
print('largest cluster')
print('  scheme:', max_scheme)
print('  size:', len(clusters[max_scheme]))

print('sample:', clusters[max_scheme][:10])

largest cluster
  scheme: ('AH0', 'N')
  size: 9100
largest cluster
  scheme: ('AH0', 'N')
  size: 9100
sample: ['aachen', 'aaron', 'aaronson', 'aaronson', 'aasen', 'abandon', 'abbreviation', 'abdication', 'abdomen', 'abdomen']


## WordNet

Use `nltk.corpus.wordnet` to look at word relations.

In [26]:
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /home/goodmami/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


- What are the synsets of *student*?

In [28]:
wn.synsets('student')

[Synset('student.n.01'), Synset('scholar.n.01')]

- What is the definition of each synset of student?

In [30]:
for synset in wn.synsets('student'):
    print(synset, synset.definition())

Synset('student.n.01') a learner who is enrolled in an educational institution
Synset('scholar.n.01') a learned person (especially in the humanities); someone who by long study has gained mastery in one or more disciplines


- What are the **hyponyms** of each synset of *student*?

In [32]:
for synset in wn.synsets('student'):
    print(synset, synset.hyponyms())
    print()  # a blank line in between helps; these are long lists

Synset('student.n.01') [Synset('art_student.n.01'), Synset('auditor.n.02'), Synset('catechumen.n.01'), Synset('collegian.n.01'), Synset('crammer.n.01'), Synset('etonian.n.01'), Synset('ivy_leaguer.n.01'), Synset('law_student.n.01'), Synset('major.n.03'), Synset('medical_student.n.01'), Synset('nonreader.n.01'), Synset('overachiever.n.01'), Synset('passer.n.03'), Synset('scholar.n.03'), Synset('seminarian.n.01'), Synset('sixth-former.n.01'), Synset('skipper.n.01'), Synset('underachiever.n.01'), Synset('withdrawer.n.05'), Synset('wykehamist.n.01')]

Synset('scholar.n.01') [Synset('academician.n.02'), Synset('alumnus.n.01'), Synset('arabist.n.01'), Synset('bibliographer.n.01'), Synset('bibliophile.n.01'), Synset('cabalist.n.03'), Synset('doctor.n.04'), Synset('goliard.n.01'), Synset('historian.n.01'), Synset('humanist.n.01'), Synset('initiate.n.02'), Synset('islamist.n.01'), Synset('licentiate.n.01'), Synset('masorete.n.01'), Synset('master.n.08'), Synset('mujtihad.n.01'), Synset('musicol

- How many synsets are avaiable for each of *professor*, *lecturer*, *instructor*, and *teacher*?

In [34]:
words = ('professor', 'lecturer', 'instructor', 'teacher')
for word in words:
    print(word, len(wn.synsets(word)))

professor 1
lecturer 2
instructor 1
teacher 2


- Are there any overlapping synsets among them?

In [36]:
# the direction of the pairing (e.g., ('professor', 'lecturer') or ('lecturer', 'professor')
# doesn't matter, so we'll avoid that with a slice in the second for-loop)
for i, word1 in enumerate(words):
    for word2 in words[i+1:]:
        print(word1, word2, set(wn.synsets(word1)).intersection(wn.synsets(word2)))

professor lecturer set()
professor instructor set()
professor teacher set()
lecturer instructor set()
lecturer teacher set()
instructor teacher {Synset('teacher.n.01')}


- Use the `lowest_common_hypernyms()` method on synsets to find what is the shared **hypernym** of *student* and *professor*. How about *professor* and *lecturer*?

In [38]:
# for this I'll pick 'student.n.01' and 'professor.n.01'
wn.synset('student.n.01').lowest_common_hypernyms(wn.synset('professor.n.01'))

[Synset('person.n.01')]

In [40]:
# for this I'll pick 'professor.n.01' and 'lector.n.02'
wn.synset('professor.n.01').lowest_common_hypernyms(wn.synset('lector.n.02'))

[Synset('educator.n.01')]

- The synsets retrieved from WordNet are generally sorted by the frequency of occurrence (bonus question: how would the "frequency of occurrence" be computed?). Write a function that tags each word in a sentence with the first synset returned by WordNet. Skip words that do not return any synsets.

In [42]:
# For bonus question: we need a corpus that has been annotated with word senses
# to determine frequency *for that corpus*. The creators of the wordnet can also
# order the synsets to encode general preference.
def sense_tag(sentence):
    words = sentence.split()
    pairs = []
    for word in words:
        synsets = wn.synsets(word)
        if synsets:
            pairs.append((word, synsets[0]))  # only the first synset
    return pairs

print(sense_tag('I bought a mouse for my laptop'))

[('I', Synset('iodine.n.01')), ('bought', Synset('buy.v.01')), ('a', Synset('angstrom.n.01')), ('mouse', Synset('mouse.n.01')), ('laptop', Synset('laptop.n.01'))]


Consider these sentences:
  - *The doctor is in, today.*
  - *The doctor is in the office, today.*
  - *The doctor's shoes are very in, this season.*

**Q:** With your sysent tagger, do all sentences get the same synset for *in*? Which sysnets should they get?

In [44]:
print(sense_tag('The doctor is in , today .'))  # spaces around punctuation because my tokenization is just split()
print(sense_tag('The doctor is in the office, today .'))
print(sense_tag("The doctor's shoes are very in , this season ."))

[('doctor', Synset('doctor.n.01')), ('is', Synset('be.v.01')), ('in', Synset('inch.n.01')), ('today', Synset('today.n.01'))]
[('doctor', Synset('doctor.n.01')), ('is', Synset('be.v.01')), ('in', Synset('inch.n.01')), ('today', Synset('today.n.01'))]
[('shoes', Synset('place.n.06')), ('are', Synset('are.n.01')), ('very', Synset('very.s.01')), ('in', Synset('inch.n.01')), ('season', Synset('season.n.01'))]


In [46]:
for synset in wn.synsets('in'):
    print(synset, synset.definition())

Synset('inch.n.01') a unit of length equal to one twelfth of a foot
Synset('indium.n.01') a rare soft silvery metallic element; occurs in small quantities in sphalerite
Synset('indiana.n.01') a state in midwestern United States
Synset('in.s.01') holding office
Synset('in.s.02') directed or bound inward
Synset('in.s.03') currently fashionable
Synset('in.r.01') to or toward the inside of


They all get the same but it's correct for any. They should get:
- in.s.01
- None
- in.s.03

**Q:** Can you think of ways we might improve the tagger to get better performance?

**A:** If we had the part-of-speech information for each word, this could help wordnet choose the correct synset more often. We could also try to build statistical models, for instance by looking at each word's left and right context to help decide, but this requires annotated data to train the model.

**Q:** How would we measure if the performance improves or degrades?

**A:** We need some **gold standard** annotations to compare against. Without those, we would have to manually determine, with our own intuition, whether an annotation is correct or not, and this has issues with consistency and scale.