HG2051 – Week 3

Learning Objectives

Data types: str list set
Concepts: comparisons conditionals loops comprehensions functions filtering stopwords efficiency
Tools: NLTK

(color key: Python/Programming NLP/CL Software Engineering)

Reading

Control Flow

The Python tutorial has a good and concise explanation of Python’s basic control-flow mechanisms:

Functions

This section on defining functions extends what we talked about in class in Week 2):

4.6 – Defining Functions

Strings Methods

For now we will cover a subset of the available string methods:

List Methods and Other Uses

Lists also have a number of useful methods and :

The `in` Operator

Many kinds of “containers” in Python (which include strings, lists, sets, and other structures) work with Python’s in operator. For most containers, an x in y operation returns True if x is one of the elements contained in y. For strings, it returns True if x is a subsequence of the elements of y:

>>> my_list = [1, 2, 3]
>>> 2 in my_list           # check for an individual element
True
>>> [1, 2] in my_list      # this does not work
False
>>> [1, 2] in [[1, 2], 3]  # unless the list actually has [1, 2] as an element
True
>>> my_str = '123'
>>> '2' in my_str          # a single character is just a string with one character
True
>>> '12' in my_str         # `in` with strings checks for substrings
True
>>> '12' in '1 2 3'        # subsequences must be exact (spaces count)
False

Question: How might you check for the presence of subsequences in lists? (hint: consider the methods or operations in the reading above)

Stopwords

Finally, also read this section of the NLTK book, but just the part about “stopwords” (just a few sentences and code blocks):

NLTK 4.1 – Wordlist Corpora

Testing Your Knowledge

Ensure you have the NLTK’s ‘gutenberg’ and ‘stopwords’ corpora downloaded by importing nltk and running the following two commands in Python (after >>>):

>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
True
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

You can find the available corpora like this:

>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Then get the NLTK’s “raw” (string) version of one of these as follows (here I get “Moby Dick”):

>>> moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

Now moby_dick is a big string containing the entire book. Use this string and the string methods in your reading to answer the following questions:

How many lines are in the file?
Is each line exactly one complete sentence?
How many tokens are in the file?
What is the average number of tokens per line?

Also use list or set comprensions to filter tokens to answer the following questions:

How many unique, case-normalized tokens are in the book?
What proportion of the case-normalized tokens are, or are not, stopwords?
What is the set of tokens in the book that begin with “whale”?
What is the set of tokens in the book that begin with “whale” and are all alphabetic characters?