# Week 8

Regular expressions, stemming, lemmatization, and segmentation.

# Regex Basics

First import the `re` library:

In [1]:
import re

Then use `re.match()` to match the following strings:

In [2]:
re.match(r'a', 'a') # a single 'a'

&lt;re.Match object; span=(0, 1), match=&#39;a&#39;&gt;

In [3]:
re.match(r'a+', 'aaaaaaaa')  # multiple 'a's

&lt;re.Match object; span=(0, 8), match=&#39;aaaaaaaa&#39;&gt;

In [4]:
re.match(r'[a-z]*', 'abcdefghijklmnopqrstuvwxyz')  # any letter

&lt;re.Match object; span=(0, 26), match=&#39;abcdefghijklmnopqrstuvwxyz&#39;&gt;

In [5]:
re.match(r'\w*', 'abcdefghijklmnopqrstuvwxyz')  # any letter (alternative)

&lt;re.Match object; span=(0, 26), match=&#39;abcdefghijklmnopqrstuvwxyz&#39;&gt;

In [6]:
re.match(r'[A-Ca-c]*', 'AaBbCc')  # upper and lower-case letters

&lt;re.Match object; span=(0, 6), match=&#39;AaBbCc&#39;&gt;

In [7]:
re.match(r'\w+', 'Aあ啊')  # all word characters (hint, use a character class)

&lt;re.Match object; span=(0, 3), match=&#39;Aあ啊&#39;&gt;

## Searching

Now use `re.search()` to find a match while ignoring false matches

In [8]:
# match the article 'a' but not other 'a's
for s in ('Apples are a fruit', 'An apple each day', 'A pear is not an apple'):
    print(list(re.finditer(r'\b([Aa]n?|[Tt]he)\b', s)))

[&lt;re.Match object; span=(11, 12), match=&#39;a&#39;&gt;]
[&lt;re.Match object; span=(0, 2), match=&#39;An&#39;&gt;]
[&lt;re.Match object; span=(0, 1), match=&#39;A&#39;&gt;, &lt;re.Match object; span=(14, 16), match=&#39;an&#39;&gt;]


Use `re.findall()` over the text of Jane Austen's *Sense and Sensibility* to find hyphenated words:

In [9]:
import nltk
sense = nltk.corpus.gutenberg.raw('austen-sense.txt')

In [10]:
set(re.findall(r'\w+(?:-\w+)+', sense))

{&#39;Bond-street&#39;,
 &#39;Cold-hearted&#39;,
 &#39;Good-by&#39;,
 &#39;Hanover-square&#39;,
 &#39;High-church&#39;,
 &#39;Mansion-house&#39;,
 &#39;Mid-summer&#39;,
 &#39;To-morrow&#39;,
 &#39;a-day&#39;,
 &#39;a-piece&#39;,
 &#39;a-year&#39;,
 &#39;after-days&#39;,
 &#39;bank-notes&#39;,
 &#39;bed-chamber&#39;,
 &#39;bed-rooms&#39;,
 &#39;before-hand&#39;,
 &#39;bowling-green&#39;,
 &#39;breakfast-room&#39;,
 &#39;broken-hearted&#39;,
 &#39;brother-in-law&#39;,
 &#39;by-and-by&#39;,
 &#39;card-table&#39;,
 &#39;card-tables&#39;,
 &#39;carpet-work&#39;,
 &#39;chimney-board&#39;,
 &#39;cold-hearted&#39;,
 &#39;common-place&#39;,
 &#39;dairy-maid&#39;,
 &#39;daughter-in-law&#39;,
 &#39;death-like&#39;,
 &#39;dining-room&#39;,
 &#39;dove-cote&#39;,
 &#39;drawing-room&#39;,
 &#39;drawing-table&#39;,
 &#39;dressing-closet&#39;,
 &#39;dressing-room&#39;,
 &#39;ear-rings&#39;,
 &#39;farm-house&#39;,
 &#39;flower-garden&#39;,
 &#39;free-spoken&#39;,
 &#39;fruit-trees&#39;,
 &#39;gentleman-

## Substitution

Now detect -y words inflected as -ies (e.g., *fly* -> *flies*). Save them to a set called `ies`.

In [11]:
ies = set(re.findall(r'(?:\w+-)*\w+ies\b', sense))
ies

{&#39;Davies&#39;,
 &#39;Indies&#39;,
 &#39;abilities&#39;,
 &#39;agonies&#39;,
 &#39;annuities&#39;,
 &#39;apologies&#39;,
 &#39;assemblies&#39;,
 &#39;assiduities&#39;,
 &#39;beauties&#39;,
 &#39;carries&#39;,
 &#39;certainties&#39;,
 &#39;cherries&#39;,
 &#39;civilities&#39;,
 &#39;cries&#39;,
 &#39;deficiencies&#39;,
 &#39;delicacies&#39;,
 &#39;dies&#39;,
 &#39;difficulties&#39;,
 &#39;duties&#39;,
 &#39;enquiries&#39;,
 &#39;entreaties&#39;,
 &#39;excellencies&#39;,
 &#39;families&#39;,
 &#39;flatteries&#39;,
 &#39;implies&#39;,
 &#39;injuries&#39;,
 &#39;inquiries&#39;,
 &#39;jealousies&#39;,
 &#39;ladies&#39;,
 &#39;legacies&#39;,
 &#39;lies&#39;,
 &#39;marries&#39;,
 &#39;opportunities&#39;,
 &#39;parties&#39;,
 &#39;probabilities&#39;,
 &#39;promontories&#39;,
 &#39;propensities&#39;,
 &#39;prophecies&#39;,
 &#39;puppies&#39;,
 &#39;remedies&#39;,
 &#39;scrutinies&#39;,
 &#39;series&#39;,
 &#39;shrubberies&#39;,
 &#39;species&#39;,
 &#39;studies&#39;,
 &#39;tries&#39;}

Use `re.sub()` to try and deinflect them to their dictionary form, and check if they are in the dictionary.

In [12]:
WORDS = nltk.corpus.words.words()

In [13]:
for word in ies:
    y = re.sub(r'ies$', r'y', word)
    if y.lower() in WORDS:
        print('YES', word)
    else:
        print('NO ', word)

YES parties
YES apologies
YES flatteries
YES families
YES difficulties
YES ladies
YES civilities
YES prophecies
YES lies
YES assiduities
YES delicacies
YES Davies
YES enquiries
YES excellencies
YES propensities
YES shrubberies
NO  dies
YES inquiries
YES beauties
YES studies
YES puppies
YES Indies
NO  species
YES injuries
YES jealousies
YES deficiencies
YES certainties
YES tries
YES scrutinies
YES assemblies
YES agonies
YES opportunities
YES annuities
YES marries
YES remedies
YES implies
YES cherries
YES probabilities
YES cries
YES entreaties
YES duties
YES legacies
NO  series
YES abilities
YES promontories
YES carries


# Stemming and Lemmatization

Above the `re.sub()` call replaced `ies` with `y`, and this is a crude form of lemmatization. Stemming is if we removed the `ies` but did not insert anything. Stemming is more robust to novel or mispelled words, but lemmatization gives cleaner results (when it works). In the NLTK, you may notice that the WordNet module can do some lemmatization. For instance, it can find synsets for *catch* when queried with *caught*:

In [14]:
from nltk.corpus import wordnet as wn
wn.synsets('caught')

[Synset(&#39;catch.v.01&#39;),
 Synset(&#39;catch.v.02&#39;),
 Synset(&#39;get.v.19&#39;),
 Synset(&#39;catch.v.04&#39;),
 Synset(&#39;get.v.11&#39;),
 Synset(&#39;hitch.v.01&#39;),
 Synset(&#39;catch.v.07&#39;),
 Synset(&#39;capture.v.06&#39;),
 Synset(&#39;catch.v.09&#39;),
 Synset(&#39;catch.v.10&#39;),
 Synset(&#39;overtake.v.01&#39;),
 Synset(&#39;catch.v.12&#39;),
 Synset(&#39;catch.v.13&#39;),
 Synset(&#39;catch.v.14&#39;),
 Synset(&#39;watch.v.03&#39;),
 Synset(&#39;catch.v.16&#39;),
 Synset(&#39;trip_up.v.01&#39;),
 Synset(&#39;catch.v.18&#39;),
 Synset(&#39;catch.v.19&#39;),
 Synset(&#39;catch.v.20&#39;),
 Synset(&#39;catch.v.21&#39;),
 Synset(&#39;catch.v.22&#39;),
 Synset(&#39;capture.v.02&#39;),
 Synset(&#39;catch.v.24&#39;),
 Synset(&#39;catch.v.25&#39;),
 Synset(&#39;catch.v.26&#39;),
 Synset(&#39;catch.v.27&#39;),
 Synset(&#39;catch.v.28&#39;),
 Synset(&#39;catch.v.29&#39;)]

This works by applying some basic morphological processing (like our 'ies' -> 'y' substitution) and looking if the result exists in WordNet. WordNet also contains some irregular forms, which is how it finds 'catch' for 'caught'. You can make use of WordNet's lemmatizer without using WordNet itself (but note it only works for English):

In [15]:
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()
wnl.lemmatize('caught', pos='v')

&#39;catch&#39;

But note that the default part-of-speech is 'n' (noun):

In [16]:
help(wnl.lemmatize)

Help on method lemmatize in module nltk.stem.wordnet:

lemmatize(word, pos=&#39;n&#39;) method of nltk.stem.wordnet.WordNetLemmatizer instance



So it won't work well on verbs if the part-of-speech is not specified, or in geneneral when the part-of-speech is incorrect:

In [17]:
wnl.lemmatize('caught')  # pos='n' is the default

&#39;caught&#39;

In [18]:
wnl.lemmatize('oxen')  # default works well for nouns

&#39;ox&#39;

In [19]:
wnl.lemmatize('oxen', pos='v')  # specifying the wrong pos is also bad

&#39;oxen&#39;