Learning Objectives
- Concepts: higher-order functions recursive functions map() filter() n-grams collocations part-of-speech tags
(color key: Python/Programming NLP/CL Software Engineering)
Reading
Higher-order and Recursive Functions
First-class Functions
By now you know how to define a function. For instance, to square a number, you might write:
Note that this definition does two things: (1) it defines the code that, when called, accepts the x
argument then computes and returns the square of x
; and (2) it assigns this “function object” to the name square
. When we invoke the name square
with parentheses and the appropriate arguments, we get a result:
And if you just type the function name without calling it in an interactive interpreter, Python will tell you it’s a function:
That is, the value of square
is the function object. This object is a value just like more traditional values in Python, such as integers and strings, and we can reassign the function to another variable if we like:
Python is said to have first-class functions, meaning that functions are not merely something to be called, but something that can be assigned, passed as arguments to other functions, and used as the return value of another function.
Higher-order Functions
Functions are said to be higher-order functions if they can accept other functions as arguments or if they return a function as their result.
Continuing our example from before, we may wish to generalize square()
to a function that, given an exponent n
, returns a function that raises x
to the power of n
:
Then we could create square
simply as follows:
You don’t even need to assign the created function to use it:
In Python, map() and filter() are built-in functions that take a function as their first argument and an iterable as their second argument. The map()
function applies the function argument to each item in the iterable and yields the result:
>>> list(map(square, range(10))) # apply square() to each value in range(10)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> [square(x) for x in range(10)] # this does the same thing in a comprehension
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Note: list()
is necessary to get the full list; otherwise map()
returns an iterator that yields one value at a time.
The filter()
function yields values from the iterable if the function argument applied to the value evaluates as true.
We can see how this works by using map()
to see the True
/False
values, and noting that str.isnumeric(x)
is the same as x.isnumeric()
when x
is a string:
>>> '1'.isnumeric() # the normal way to call `isnumeric()`
True
>>> str.isnumeric('1') # the functional way to call it
True
>>> list(map(str.isnumeric, '1 two 3 four 5'.split()))
[True, False, True, False, True]
To see how higher-order functions might help with NLP, please read this short section from the NLTK book:
Recursive Functions
You may know of recursive linguistic structures, such as grammatical constructs which allow sentences like Kim knows that Sandy knows that Pat knows that the dog barked. Similarly, many programming languages allow recursive functions which call themselves. For example, you may want to write a function that computes the factorial of some integer x
:
def factorial(x):
# e.g., if x is 7, this would return 7 * factorial(6),
# which returns 6 * factorial(5), etc.
return x * factorial(x - 1)
Let’s try it out:
>>> factorial(7)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in factorial
File "<stdin>", line 2, in factorial
File "<stdin>", line 2, in factorial
[Previous line repeated 996 more times]
RecursionError: maximum recursion depth exceeded
Oops. This did 7 * (6 * (5 * ...))
but it did not stop at 1, so it continued: ... * (0 * (-1 * ...))
. Recursive functions must have a base case which decides when it should stop. For instance:
And try again:
For some problems, recursive functions are the natural way to compute a result, but they have limits (as seen by the RecursionError
above) and can lead to very inefficient code. Recursive algorithms are often more efficient as an iterative loop (for
, while
, etc.), but the iterative code can be less intuitive.
N-grams
Bigrams, Trigrams, and more generally n-grams are subsequences of some sequence with length 2, 3, or more. In computational linguistics we often use n-grams for sequences of tokens, words, characters, etc. The NLTK has the nltk.ngrams()
function which takes an iterable and a number n
and yields subsequence tuples of length n
from the iterable:
>>> import nltk
>>> four_words = 'many dogs chase cats'.split()
>>> list(nltk.ngrams(four_words, 1)) # unigrams
[('many',), ('dogs',), ('chase',), ('cats',)]
>>> list(nltk.ngrams(four_words, 2)) # bigrams
[('many', 'dogs'), ('dogs', 'chase'), ('chase', 'cats')]
>>> list(nltk.ngrams(four_words, 3)) # trigrams
[('many', 'dogs', 'chase'), ('dogs', 'chase', 'cats')]
>>> list(nltk.ngrams(four_words, 4)) # 4-grams
[('many', 'dogs', 'chase', 'cats')]
>>> list(nltk.ngrams(four_words, 5)) # 5-grams
[]
See also: Google’s Ngram Viewer
The items in the n-grams might be complex objects, such as n-grams of word-tag pairs:
>>> nltk.download('brown') # if you don't have it yet
>>> from nltk.corpus import brown
>>> tagged_sent = brown.tagged_sents()[0]
>>> tagged_sent[0:3] # inspect some items
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]
>>> tagged_bigrams = list(nltk.ngrams(tagged_sent, 2))
>>> for bigram in tagged_bigrams[:2]: # inspect the first two bigrams
... print(bigram) # note that each bigram is a word-tag pair
(('The', 'AT'), ('Fulton', 'NP-TL'))
(('Fulton', 'NP-TL'), ('County', 'NN-TL'))
Collocations
N-grams of words that occur more often than you would expect in a random distribution are called collocations. Furthermore, these words may have some special, perhaps non-compositional meaning that wouldn’t be obvious by the individual words. By counting the occurrences of n-gram pairs and comparing them to the counts of the words individually, we can find those that tend to appear together. For example:
def collocations(words):
from operator import itemgetter
# Count the words and bigrams
wfd = nltk.FreqDist(words)
pfd = nltk.FreqDist(nltk.bigrams(words)) # bigrams is the same as nltk.ngrams(words, 2)
# Compute the scores for each pair
scored = [((w1, w2), score(w1, w2, wfd, pfd)) for w1, w2 in pfd]
## sort according to the score
scored.sort(key=itemgetter(1), reverse=True)
return [p for (p, s) in scored]
def score(word1, word2, wfd, pfd, power=3):
'''return the collocation score f(w1,w2)^power/(f(w1)*f(w2))'''
freq1 = wfd[word1]
freq2 = wfd[word2]
freq12 = pfd[(word1, word2)]
return freq12 ** power / float(freq1 * freq2)
And in use:
>>> from nltk.corpus import gutenberg
>>> words = [w for w in gutenberg.words('melville-moby_dick.txt') if len(w) > 2]
>>> print(collocations(words)[:10])
[('Moby', 'Dick'), ('Sperm', 'Whale'), ('White', 'Whale'), ('Dough', 'Boy'), ('Mrs', 'Hussey'), ('Sag', 'Harbor'), ('Father', 'Mapple'), ('New', 'Bedford'), ('Fast', 'Fish'), ('have', 'been')]
For further information (not required reading), see:
Parts-of-Speech
I assume you are familiar with word categories, namely “parts of speech”. If not, please read or skim the Wikipedia article. Then please read the following sections of the NLTK book about part-of-speech tagging:
Testing Your Knowledge
Questions
Q: If you have a list of length
m
, how many bigrams will it have? Trigrams? n-grams in general?Q: How might you deal with the problem of getting an n-gram from a sequence where the length of the sequence is less than
n
?Q: Do all languages use the same parts of speech? Does a single language have a clear and complete set of parts-of-speech?
Q: What is a tagset? What are examples of tagsets?
Practical Work
Write a recursive function
fib(x)
to compute the Fibonacci number ofx
. Callingfib(1)
should return 1,fib(5)
returns 8, and so on. What happens if you callfib(20)
,fib(30)
, orfib(40)
? (use Ctrl-C to kill a stuck process). Now try rewriting the recursive function as an iterative function. Can you go higher thanfib(40)
?Write a function
bigrams(xs)
that takes a listxs
and returns a list of bigrams fromxs
.Write a function that takes a string of
word/TAG
formatted pairs and returns the list of (word, tag) tuples (hint: usenltk.str2tuple()
)Write a function that uses
nltk.map_tag()
to map a list of (word, tag) pairs to the universal tagset.