HG2051 – Week 8

Learning Objectives

Concepts: regular expressions re stemming lemmatization segmentation

(color key: Python/Programming NLP/CL Software Engineering)

Reading

Please see the slides from my 2019 workshop on regular expressions for a quick introduction:

Searching for Patterns with Regular Expressions

For applications of regular expressions to NLP, please read these sections from the NLTK book:

Also, we’ve discussed tokenization and basic normalization already, but now see the following to better understand stemming, lemmatization, and segmentation.

Additional Reading

These links may be helpful, but are not assigned reading:

Regular Expression HOWTO (Python documentation, by A.M. Kuchling)
Python Regular Expressions (Google for Education)
regex101 (Useful web-app for constructing and inspecting regular expressions)

Testing Your Knowledge

Questions

Q: What are regex metacharacters?
Q: How is stemming different from lemmatization?
Q: What is a kind of segmentation that is not tokenization/word-segmentation?

Practical Work

Write regular expressions to match the following classes of strings:
- A single determiner (assume that a, an, and the are the only determiners)
- An arithmetic expression using integers, addition, and multiplication, such as 2*3+8
- Phone numbers (e.g., +65 8012 3456)
Create a function plural() that takes an English word and returns its plural form. Test it on dog, apple, fly, boy, woman.
Find all verb particles (things like give up, look out) in wordnet.
Try to expand them to different inflectional forms: give up, giving up, gave up, given up