# Week 12

Overview

* [**Python Fundamentals**](#Python-Fundamentals)
  * [Data Types and Structures](#Data-Types-and-Structures)
  * [Assignment and Mutability](#Assignment-and-Mutability)
  * [Control Structures](#Control-Structures)
  * [Functions](#Functions)
  * [Built-in Functions](#Built-in-Functions)
  * [Regular Expressions](#Regular-Expressions)
  * [File I/O, Buffers, and Encodings](#File-I/O,-Buffers,-and-Encodings)
  * [Basics of Software Engineering](#Basics-of-Software-Engineering)

* [**NLP and the NLTK**](#NLP-and-the-NLTK)
  * [Working with Corpora](#Working-with-Corpora)
  * [Frequency Distributions](#Frequency-Distributions)
  * [Tokenization](#Tokenization)
  * [Stemming and Lemmatization](#Stemming-and-Lemmatization)
  * [Wordnets](#Wordnets)
  * [N-grams and Sequence Models](#N-grams-and-Sequence-Models)
  * [Part-of-speech Tagging](#Part-of-speech-Tagging)
  * [Supervised Classification](#Supervised-Classification)

## Python Fundamentals

These are the basic features of Python that you should know how to use.

### Data Types and Structures

Numbers, strings, and booleans are primary data types in Python, while lists, sets, tuples, and dicts are data structures (they are containers of other data types or structures).

#### Numbers (int, float)

Integers are "whole" numbers while floats have a fractional component.

In [98]:
42 # make the integer 42

42

In [99]:
3.14159 # make the float 3.14159

3.14159

In [100]:
2 + 3 # add two numbers

5

In [101]:
2 / 3 # divide two integers (is the result an integer?)

0.6666666666666666

In [102]:
2**3 # raise 2 to the power of 3

8

#### Strings

Strings are immutable sequences of characters.

In [103]:
s = 'Kalau rasa gembira tepuk tangan' # make a string for the following: Kalau rasa gembira tepuk tangan

In [104]:
s.upper() # upcase all the letters (use a string method)

'KALAU RASA GEMBIRA TEPUK TANGAN'

In [105]:
s.split() # split the words into a list 

['Kalau', 'rasa', 'gembira', 'tepuk', 'tangan']

In [106]:
'Kim said, "we\'re going for tea"' # make a string for the following: Kim said, "we're going for tea"
'''Kim said, "we're going for tea"'''

'Kim said, "we\'re going for tea"'

In [107]:
'Hello, {}'.format('Sandy') # format a (new) string to say "Hello, (name)" where (name) is a value passed to the str.format() function

'Hello, Sandy'

#### Booleans and None

Booleans are `True` and `False` values used in control structures and are the result of comparisons. `None` is a special value meaning "no value" (e.g., when a function returns nothing). Boolean values can be used in boolean logic (`and`, `or`, `not`).

In [108]:
True and (False or True)  # what is the result of: True and (False or True)

True

In [109]:
1 < 2 and ('abc'.startswith('z') or bool(-1))  # what is the result of: 1 < 2 and ('abc'.startswith('z') or bool(-1))

True

#### Lists

Lists are mutable sequences of data.

In [110]:
x = [1, 2, 3, 4, 5] # make a list containing the values 1, 2, 3, 4, 5

In [111]:
x = [i**2 for i in x] # use a list comprehension to square those values

In [112]:
15 in x # determine if 15 is in the list

False

In [113]:
x[-1] # find the value of the last item

25

In [114]:
x.index(4) # find the position of the value 4

1

In [115]:
x[2] = -9  # replace the item at position 2 with the value -9
x

[1, 4, -9, 16, 25]

#### Sets

Sets are mutable collections of unique items.

In [116]:
s = set('Cappuccino or latte?') # create a set for the characters in the string: Cappuccino or latte?

In [117]:
len(s) # how many items does it contain

14

In [118]:
s2 = set('Do you have milk tea?') # take the itersection of the set with the set containing the characters in: Do you have milk tea?
s.intersection(s2)

{' ', '?', 'a', 'e', 'i', 'l', 'o', 't', 'u'}

#### Tuples

Tuples are immutable sequences of data. 

In [119]:
("Kim", 1, "tea", True) # create a tuple containing the values "Kim", 1, "tea", True

('Kim', 1, 'tea', True)

In [120]:
# unpack the tuple into the variables: name, quantity, beverage, has_milk
name, quantity, beverage, has_milk = ("Kim", 1, "tea", True)
beverage

'tea'

#### Dicts

Dictionaries (dicts) are mutable mappings of keys to values.

In [121]:
d = {'Kim': 'milk tea'}  # create a dict mapping the key "Kim" to "milk tea"
d

{'Kim': 'milk tea'}

In [122]:
d["Sandy"] = "long black" # add a new entry mapping "Sandy" to "long black"

In [123]:
d["Kim"] # retrieve the value of the key "Kim"

'milk tea'

### Assignment and Mutability

Values may be assigned to variables. Primary types are always immutable, while data structures may be mutable or immutable.

* What does mutability mean?
* What is the value of `y` after the following is executed:

  ```python
  >>> x = 5
  >>> y = x
  >>> x += 3
  ```
  
* What is the value of `z` after the following is executed:

  ```python
  >>> w = [1, 2]
  >>> z = w
  >>> w.append(3)
  >>> w = [4, 5, 6]
  ```

### Control Structures

Programs usually execute each step sequentially. Control structures alter the flow of execution.

#### if statements

`if` statements execute a nested block of code only if the condition passes; if it doesn't pass, any following `elif` statements are then attempted with other conditions. Finally, if no condition passes and there's an `else` statement, its block is executed.

In [124]:
x = int(2/3)
if x > 0:# print 'greater than' if the value of x is greater than 0,
    print('greater than')
elif x < 0: # print 'less than' if x is less than 0
    print('less than')
else: # otherwise print 'equal to'
    print('equal to')

equal to


#### for loops

`for` loops iterate over some sequence of data and execute once for each item.

In [125]:
x = ['one', 'two', 'three']
for i in x:
    print('---{}---'.format(i))# print each string in x surrounded by '---' (e.g., '---one---', then '---two---', etc.)

---one---
---two---
---three---


#### while loops

`while` loops execute their code until some condition is met.

In [126]:
x = [5, 4, 3, 2, 1]
# while x has more than one number, take the last two elements (hint: use list.pop())
# add the two numbers
# print the result
# and append the result back onto x
while len(x) > 1:
    a = x.pop()
    b = x.pop()
    c = a + b
    print(c)
    x.append(c)

3
6
10
15


### Functions

Functions let you write some code that can be reused by calling it later, possibly with different parameters.

In [127]:
# write a function double() that takes a value x and returns x multiplied by 2
def double(x):
    return x * 2

In [128]:
assert double(4) == 8
assert double('ha') == 'haha'

### Built-in Functions

You should know what [Python's built-in functions](https://docs.python.org/3/library/functions.html) do, and know how to use them (perhaps using some documentation).

#### `int()`, `float()`, `str()`, `bool()`, `list()`, `tuple()`, `dict()`, and `set()`

These functions cast or create values for Python's basic data types.

In [129]:
int('10')

10

In [130]:
float(3)

3.0

In [131]:
str(3)

'3'

In [132]:
bool(3)

True

In [133]:
list('abc')

['a', 'b', 'c']

In [134]:
tuple('abc')

('a', 'b', 'c')

In [135]:
dict([('a', 1), ('b', 2)])

{'a': 1, 'b': 2}

In [136]:
set('aaabbc')

{'a', 'b', 'c'}

#### `all()`, `any()`, `min()`, `max()`, `sum()`, `len()`

These functions "reduce" iterables: that is, they take an iterable like a list or set, and return a single value.

In [137]:
all([True, True, False])

False

In [138]:
any([False, False, True])

True

In [139]:
min([8, 6, 7, 5, 3, 0, 9])

0

In [140]:
max([8, 6, 7, 5, 3, 0, 9])

9

In [141]:
sum([8, 6, 7, 5, 3, 0, 9])

38

In [142]:
len('abcdefghijklmnopqrstuvwxyz')

26

#### `map()`, `sorted()`, `reversed()`, `range()`, and `zip()`

These functions take iterables and produce more iterables. `map()`, `sorted()`, and `reversed()` are **higher-order functions** because they can also take another function as a parameter to determine their output.

In [143]:
list(map(str.lower, ['Pierre', 'Vinken']))

['pierre', 'vinken']

In [144]:
sorted('Chef Charlie chopped chilli peppers while chef Chloe charred chicken.'.split())

['Charlie',
 'Chef',
 'Chloe',
 'charred',
 'chef',
 'chicken.',
 'chilli',
 'chopped',
 'peppers',
 'while']

In [145]:
sorted('Chef Charlie chopped chilli peppers while chef Chloe charred chicken.'.split(), key=str.lower)

['Charlie',
 'charred',
 'Chef',
 'chef',
 'chicken.',
 'chilli',
 'Chloe',
 'chopped',
 'peppers',
 'while']

In [146]:
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [147]:
list(zip('abc', range(3)))

[('a', 0), ('b', 1), ('c', 2)]

#### `open()` and `print()`

These functions work with files and buffers for reading and writing text or bytes.

In [148]:
import tempfile
import os
with tempfile.TemporaryDirectory() as tmpdirname:  # don't worry about tempfile; you don't need to know it
    path = os.path.join(tmpdirname, 'myfile.txt')

    # write to a file
    with open(path, 'w') as outfile:
        print('some data...', file=outfile)
        
    # read it back
    with open(path, 'r') as infile:
        print('read from file:', infile.read())

read from file: some data...



### Regular Expressions

Regular expressions are used for matching patterns in text.

In [149]:
import re
re.match(r'one', 'three two one')  # no match: match() must match the start of the string
re.search(r'one', 'three two one')  # search will find the first match

<re.Match object; span=(10, 13), match='one'>

### File I/O, Buffers, and Encodings

Files and responses to web requests are **buffers** that hold a portion of some data at a time. They generally contain bytes which must be decoded using the appropriate **encoding** to get a unicode string.

In [150]:
from urllib import request
response = request.urlopen('http://andrew.triumf.ca/multilingual/samples/korean.html')
# response is a buffer, use its read() method to store the data in a variable named raw
raw = response.read()

In [151]:
# decode the bytes into a unicode string using the 'ksc5601' encoding
s = raw.decode('ksc5601')
print(s)

$)C
<HEAD>
<TITLE>Korean / 한글 (KSC 5601)</TITLE>
</HEAD>
<BODY>
<H1>Korean / 한글 (KSC 5601)</H1>
안녕하세요, 안녕하십니까
</BODY>



### Basics of Software Engineering

Merely writing a program that does what you want is often not enough. Months later you may find that you are unable to read your own code, or the code may fail to work with some new version of its software dependencies. **Software Engineering** is a set of practices to ensure the reliability and maintainability of code. There are several ways to accomplish this.

#### Comments and Docstrings

Comments should describe the "why" of your code, while the code describes the "how".

Docstrings give documentation about what a function or module does, what assumptions it makes, corner cases, etc.

#### Encapsulation: Functions and Modules

Functions and modules can be created to enable the reuse of code. Often the same piece of code is useful for many tasks, so putting it in a function, or several functions in a module, etc., makes it easier to reuse them.

#### Testing Your Code

Software tests are useful to ensure the correctness and good performance of code. Some kinds of tests include **unit tests** (testing single functions), **regression tests** (testing the system as a whole), and **performance tests** (testing the code's ability to perform under different work loads).

## NLP and the NLTK

You are expected to know some basic tasks in the field of natural language processing and how to accomplish those tasks using the NLTK (with documentation, perhaps).

In [152]:
import nltk  # run this

### Working with Corpora

We've discussed two main kinds of corpora: **lexical resources** and **texts**.

* What are the differences between lexical resources and texts?
* Which of the following are lexical resources?
  - Wordnet
  - Newspaper archives
  - Spoken dialog
  - Bilingual dictionaries
  - Lists of stopwords

In [153]:
# import the nltk.corpus.stopwords corpus
from nltk.corpus import stopwords
# get the list of indonesian stopwords
stopwords.words('indonesian')

['ada',
 'adalah',
 'adanya',
 'adapun',
 'agak',
 'agaknya',
 'agar',
 'akan',
 'akankah',
 'akhir',
 'akhiri',
 'akhirnya',
 'aku',
 'akulah',
 'amat',
 'amatlah',
 'anda',
 'andalah',
 'antar',
 'antara',
 'antaranya',
 'apa',
 'apaan',
 'apabila',
 'apakah',
 'apalagi',
 'apatah',
 'artinya',
 'asal',
 'asalkan',
 'atas',
 'atau',
 'ataukah',
 'ataupun',
 'awal',
 'awalnya',
 'bagai',
 'bagaikan',
 'bagaimana',
 'bagaimanakah',
 'bagaimanapun',
 'bagi',
 'bagian',
 'bahkan',
 'bahwa',
 'bahwasanya',
 'baik',
 'bakal',
 'bakalan',
 'balik',
 'banyak',
 'bapak',
 'baru',
 'bawah',
 'beberapa',
 'begini',
 'beginian',
 'beginikah',
 'beginilah',
 'begitu',
 'begitukah',
 'begitulah',
 'begitupun',
 'bekerja',
 'belakang',
 'belakangan',
 'belum',
 'belumlah',
 'benar',
 'benarkah',
 'benarlah',
 'berada',
 'berakhir',
 'berakhirlah',
 'berakhirnya',
 'berapa',
 'berapakah',
 'berapalah',
 'berapapun',
 'berarti',
 'berawal',
 'berbagai',
 'berdatangan',
 'beri',
 'berikan',
 'berikut'

In [154]:
# count how many sentences are in the Herman Melville story Moby Dick (from the nltk.corpus.gutenberg corpus)
from nltk.corpus import gutenberg
gutenberg.fileids()  # get the list of file ids
len(gutenberg.sents('melville-moby_dick.txt'))


10059

### Frequency Distributions

Frequency distributions are used to count things. Conditional frequency distributions count things separately for different contexts.

In [155]:
# create a frequency distribution of the downcased words in Moby Dick, excluding English stopwords
stop = set(stopwords.words('english'))
words = [word.lower()
         for word in gutenberg.words('melville-moby_dick.txt')]
words = [word for word in words if word not in stop]
nltk.FreqDist(words)

FreqDist({',': 18713, '.': 6862, ';': 4072, "'": 2684, '-': 2552, '"': 1478, '!': 1269, 'whale': 1226, '--': 1070, 'one': 921, ...})

### Tokenization

Tokenization breaks up a string into smaller strings. As this is often one of the first steps in **NLP pipelines** (sequential analysis tasks), the way it is done can have a large effect on downstream performance.

In [156]:
# tokenize the sentence: "Ms. Tan showed the flat to prospective renters from Taiwan, Japan, and Indonesia."
sent = "Ms. Tan showed the flat to prospective renters from Taiwan, Japan, and Indonesia."
import re
re.findall(r'\w+', sent)  # this method will remove punctuation


['Ms',
 'Tan',
 'showed',
 'the',
 'flat',
 'to',
 'prospective',
 'renters',
 'from',
 'Taiwan',
 'Japan',
 'and',
 'Indonesia']

### Stemming and Lemmatization

Stemming removes morphological inflection while lemmatization changes inflected words to a base form. Both are kinds of lexical normalization used to reduce data sparsity.

In [171]:
# use nltk.stem.WordNetLemmatizer and nltk.stem.porter.PorterStemmer to lemmatize and stem some inflected words
import nltk.stem
import nltk.stem.porter

wnl = nltk.stem.WordNetLemmatizer()
print('lemmatize "hooves":', wnl.lemmatize('hooves'))

ps = nltk.stem.porter.PorterStemmer()
print('stem "hooves":', ps.stem('hooves'))

lemmatize "hooves": hoof
stem "hooves": hoov


### Wordnets

Wordnets are highly-structured lexical resources that describe the relationships between words.

In [173]:
from nltk.corpus import wordnet as wn
# find the common ancestor synset for "dog" and "cat"
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [174]:
dog = wn.synsets('dog')[0]

In [176]:
wn.synsets('cat')

[Synset('cat.n.01'),
 Synset('guy.n.01'),
 Synset('cat.n.03'),
 Synset('kat.n.01'),
 Synset('cat-o'-nine-tails.n.01'),
 Synset('caterpillar.n.02'),
 Synset('big_cat.n.01'),
 Synset('computerized_tomography.n.01'),
 Synset('cat.v.01'),
 Synset('vomit.v.01')]

In [182]:
cat = wn.synsets('cat')[0]

In [183]:
dog.common_hypernyms(cat)

[Synset('animal.n.01'),
 Synset('entity.n.01'),
 Synset('mammal.n.01'),
 Synset('placental.n.01'),
 Synset('object.n.01'),
 Synset('carnivore.n.01'),
 Synset('chordate.n.01'),
 Synset('physical_entity.n.01'),
 Synset('whole.n.02'),
 Synset('vertebrate.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01')]

In [181]:
dog.lowest_common_hypernyms(cat)

[Synset('carnivore.n.01')]

### N-grams and Sequence Models

N-grams are small slices taken from sequential data. E.g., a "bigram" is a 2-tuple, a "trigram" is a 3-tuple, and a "unigram" is a 1-tuple. They are used to provide generalized models of the order of items in a sequence.

In [83]:
# Produce a list of the bigrams in the sentence: "Kim likes milk tea."
words = "Kim likes milk tea.".split()
list(zip(words[:], words[1:]))

[('Kim', 'likes'), ('likes', 'milk'), ('milk', 'tea.')]

In [184]:
# Create a conditional frequency distribution of bigrams for the sentences "the dog chased the cat" and "the dog slept".
s1 = "the dog chased the cat .".split()
s2 = "the dog slept .".split()

cfd = nltk.ConditionalFreqDist(nltk.bigrams(s1 + s2))
print(cfd.conditions())

['the', 'dog', 'chased', 'cat', '.', 'slept']


In [185]:
# what is the most likely word to follow "the"?
cfd['the']

FreqDist({'dog': 2, 'cat': 1})

### Part-of-speech Tagging

Words in a text can be annotated with their categories, called **parts of speech**, to provide a basic syntactic description.

In [90]:
# Use `nltk.pos_tag()` to tag the sentence "But Sandy enjoys coffee." (hint: you need to tokenize it first)
nltk.pos_tag("But Sandy enjoys coffee .".split())

[('But', 'CC'),
 ('Sandy', 'NNP'),
 ('enjoys', 'VBZ'),
 ('coffee', 'NN'),
 ('.', '.')]

### Supervised Classification

Supervised classification is one kind of **machine learning**. It is **supervised** because each data instance is annotated with a task-specific label, and the machine has to learn the associations between properties of the instance and the label.

In [91]:
from nltk.corpus import movie_reviews  # load the movie_reviews corpus
movie_reviews.categories()

['neg', 'pos']

In [96]:
# create a variable 'data' that contains the labeled data
# hint: for each category, get the fileids for the category,
#       then associate all words in each file to its category
data = []
for label in movie_reviews.categories():
    for fileid in movie_reviews.fileids(categories=label):
        words = movie_reviews.words(fileid)
        data.append((words, label))


In [98]:
# get the size of 1/5 of the data
k = 5
index = int(len(data) / k)
index

400

In [103]:
# split the data into test and train portions
import random
random.shuffle(data)  # mix up the neg and pos items
test = data[:index]
train = data[index:]

In [101]:
# define a feature extractor function
def features(pair):
    sent, label = pair
    feats = {word.lower(): True for word in sent}
    return (feats, label)

In [104]:
# use the function to produce lists of (feature_dict, label) pairs for train and test
train_feats = list(map(features, train))
test_feats = list(map(features, test))

In [105]:
# train a naive bayes classifier with the training data
nb = nltk.NaiveBayesClassifier.train(train_feats)

In [108]:
# evaluate the accuracy on the test data
nltk.classify.accuracy(nb, test_feats)

0.68

In [110]:
# is that accuracy any good? We need to know the distribution of the test data

pos = len([label for _, label in test_feats if label=='pos'])
print(pos / len(test_feats))

0.48


68% accuracy is better than 52% (if we labeled all test items as "neg"). Note that These numbers may fluctuate depending on what `random.shuffle()` does above.