Learning Objectives
- Concepts: regular expressions re stemming lemmatization segmentation
(color key: Python/Programming NLP/CL Software Engineering)
Reading
Please see the slides from my 2019 workshop on regular expressions for a quick introduction:
For applications of regular expressions to NLP, please read these sections from the NLTK book:
- NLTK 3.4 – Regular Expressions for Detecting Word Patterns
- NLTK 3.5 – Useful Applications of Regular Expressions
Also, we’ve discussed tokenization and basic normalization already, but now see the following to better understand stemming, lemmatization, and segmentation.
- NLTK 3.6 – Normalizing Text
- NLTK 3.7 – Regular Expressions for Tokenizing Text
- NLTK 3.8 – Segmentation
Additional Reading
These links may be helpful, but are not assigned reading:
- Regular Expression HOWTO (Python documentation, by A.M. Kuchling)
- Python Regular Expressions (Google for Education)
- regex101 (Useful web-app for constructing and inspecting regular expressions)
Testing Your Knowledge
Questions
- Q: What are regex metacharacters?
- Q: How is stemming different from lemmatization?
- Q: What is a kind of segmentation that is not tokenization/word-segmentation?
Practical Work
- Write regular expressions to match the following classes of strings:
- A single determiner (assume that a, an, and the are the only determiners)
- An arithmetic expression using integers, addition, and multiplication, such as
2*3+8
- Phone numbers (e.g.,
+65 8012 3456
)
- Create a function
plural()
that takes an English word and returns its plural form. Test it on dog, apple, fly, boy, woman. - Find all verb particles (things like give up, look out) in wordnet.
- Try to expand them to different inflectional forms: give up, giving up, gave up, given up