Week 3

Learning Objectives

(color key: Python/Programming NLP/CL Software Engineering)

Reading

Control Flow

The Python tutorial has a good and concise explanation of Python’s basic control-flow mechanisms:

Functions

This section on defining functions extends what we talked about in class in Week 2):

Strings Methods

For now we will cover a subset of the available string methods:

List Methods and Other Uses

Lists also have a number of useful methods and :

The in Operator

Many kinds of “containers” in Python (which include strings, lists, sets, and other structures) work with Python’s in operator. For most containers, an x in y operation returns True if x is one of the elements contained in y. For strings, it returns True if x is a subsequence of the elements of y:

>>> my_list = [1, 2, 3]
>>> 2 in my_list           # check for an individual element
True
>>> [1, 2] in my_list      # this does not work
False
>>> [1, 2] in [[1, 2], 3]  # unless the list actually has [1, 2] as an element
True
>>> my_str = '123'
>>> '2' in my_str          # a single character is just a string with one character
True
>>> '12' in my_str         # `in` with strings checks for substrings
True
>>> '12' in '1 2 3'        # subsequences must be exact (spaces count)
False

Question: How might you check for the presence of subsequences in lists? (hint: consider the methods or operations in the reading above)

Stopwords

Finally, also read this section of the NLTK book, but just the part about “stopwords” (just a few sentences and code blocks):

Testing Your Knowledge

Ensure you have the NLTK’s ‘gutenberg’ and ‘stopwords’ corpora downloaded by importing nltk and running the following two commands in Python (after >>>):

>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
True
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

You can find the available corpora like this:

>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Then get the NLTK’s “raw” (string) version of one of these as follows (here I get “Moby Dick”):

>>> moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

Now moby_dick is a big string containing the entire book. Use this string and the string methods in your reading to answer the following questions:

Also use list or set comprensions to filter tokens to answer the following questions: