# Week 12

Overview

* [**Python Fundamentals**](#Python-Fundamentals)
  * [Data Types and Structures](#Data-Types-and-Structures)
  * [Assignment and Mutability](#Assignment-and-Mutability)
  * [Control Structures](#Control-Structures)
  * [Functions](#Functions)
  * [Built-in Functions](#Built-in-Functions)
  * [Regular Expressions](#Regular-Expressions)
  * [File I/O, Buffers, and Encodings](#File-I/O,-Buffers,-and-Encodings)
  * [Basics of Software Engineering](#Basics-of-Software-Engineering)

* [**NLP and the NLTK**](#NLP-and-the-NLTK)
  * [Working with Corpora](#Working-with-Corpora)
  * [Frequency Distributions](#Frequency-Distributions)
  * [Tokenization](#Tokenization)
  * [Stemming and Lemmatization](#Stemming-and-Lemmatization)
  * [Wordnets](#Wordnets)
  * [N-grams and Sequence Models](#N-grams-and-Sequence-Models)
  * [Part-of-speech Tagging](#Part-of-speech-Tagging)
  * [Supervised Classification](#Supervised-Classification)

## Python Fundamentals

These are the basic features of Python that you should know how to use.

### Data Types and Structures

Numbers, strings, and booleans are primary data types in Python, while lists, sets, tuples, and dicts are data structures (they are containers of other data types or structures).

#### Numbers (int, float)

Integers are "whole" numbers while floats have a fractional component.

In [None]:
# make the integer 42

In [None]:
# make the float 3.14159

In [None]:
# add two numbers

In [None]:
# divide two integers (is the result an integer?)

In [None]:
# raise 2 to the power of 3

#### Strings

Strings are immutable sequences of characters.

In [None]:
s = '' # make a string for the following: Kalau rasa gembira tepuk tangan

In [None]:
# upcase all the letters (use a string method)

In [None]:
# split the words into a list 

In [None]:
# make a string for the following: Kim said, "we're going for tea"

In [None]:
# format a (new) string to say "Hello, (name)" where (name) is a value passed to the str.format() function

#### Booleans and None

Booleans are `True` and `False` values used in control structures and are the result of comparisons. `None` is a special value meaning "no value" (e.g., when a function returns nothing). Boolean values can be used in boolean logic (`and`, `or`, `not`).

In [None]:
# what is the result of: True and (False or True)

In [None]:
# what is the result of: 1 < 2 and ('abc'.startswith('z') or bool(-1))

#### Lists

Lists are mutable sequences of data.

In [None]:
# make a list containing the values 1, 2, 3, 4, 5

In [None]:
# use a list comprehension to square those values

In [None]:
# determine if 15 is in the list

In [None]:
# find the value of the last item

In [None]:
# find the position of the value 4

In [None]:
# replace the item at position 2 with the value -9

#### Sets

Sets are mutable collections of unique items.

In [None]:
# create a set for the characters in the string: Cappuccino or latte?

In [None]:
# how many items does it contain

In [None]:
# take the itersection of the set with the set containing the characters in: Do you have milk tea?

#### Tuples

Tuples are immutable sequences of data. 

In [None]:
# create a tuple containing the values "Kim", 1, "tea", True

In [None]:
# unpack the tuple into the variables: name, quantity, beverage, has_milk


#### Dicts

Dictionaries (dicts) are mutable mappings of keys to values.

In [None]:
# create a dict mapping the key "Kim" to "milk tea"

In [None]:
# add a new entry mapping "Sandy" to "long black"

In [None]:
# retrieve the value of the key "Kim"

### Assignment and Mutability

Values may be assigned to variables. Primary types are always immutable, while data structures may be mutable or immutable.

* What does mutability mean?
* What is the value of `y` after the following is executed:

  ```python
  >>> x = 5
  >>> y = x
  >>> x += 3
  ```
  
* What is the value of `z` after the following is executed:

  ```python
  >>> w = [1, 2]
  >>> z = w
  >>> w.append(3)
  >>> w = [4, 5, 6]
  ```

### Control Structures

Programs usually execute each step sequentially. Control structures alter the flow of execution.

#### if statements

`if` statements execute a nested block of code only if the condition passes; if it doesn't pass, any following `elif` statements are then attempted with other conditions. Finally, if no condition passes and there's an `else` statement, its block is executed.

In [None]:
x = int(2/3)
# print 'greater than' if the value of x is greater than 0,
# print 'less than' if x is less than 0
# otherwise print 'equal to'

#### for loops

`for` loops iterate over some sequence of data and execute once for each item.

In [None]:
x = ['one', 'two', 'three']
# print each string in x surrounded by '---' (e.g., '---one---', then '---two---', etc.)

#### while loops

`while` loops execute their code until some condition is met.

In [None]:
x = [5, 4, 3, 2, 1]
# while x has more than one number, take the last two elements (hint: use list.pop())
# add the two numbers
# print the result
# and append the result back onto x


### Functions

Functions let you write some code that can be reused by calling it later, possibly with different parameters.

In [None]:
# write a function double() that takes a value x and returns x multiplied by 2


In [None]:
assert double(4) == 8
assert double('ha') == 'haha'

### Built-in Functions

You should know what [Python's built-in functions](https://docs.python.org/3/library/functions.html) do, and know how to use them (perhaps using some documentation).

#### `int()`, `float()`, `str()`, `bool()`, `list()`, `tuple()`, `dict()`, and `set()`

These functions cast or create values for Python's basic data types.

In [None]:
int('10')

In [None]:
float(3)

In [None]:
str(3)

In [None]:
bool(3)

In [None]:
list('abc')

In [None]:
tuple('abc')

In [None]:
dict([('a', 1), ('b', 2)])

In [None]:
set('aaabbc')

#### `all()`, `any()`, `min()`, `max()`, `sum()`, `len()`

These functions "reduce" iterables: that is, they take an iterable like a list or set, and return a single value.

In [None]:
all([True, True, False])

In [None]:
any([False, False, True])

In [None]:
min([8, 6, 7, 5, 3, 0, 9])

In [None]:
max([8, 6, 7, 5, 3, 0, 9])

In [None]:
sum([8, 6, 7, 5, 3, 0, 9])

In [None]:
len('abcdefghijklmnopqrstuvwxyz')

#### `map()`, `sorted()`, `reversed()`, `range()`, and `zip()`

These functions take iterables and produce more iterables. `map()`, `sorted()`, and `reversed()` are **higher-order functions** because they can also take another function as a parameter to determine their output.

In [None]:
list(map(str.lower, ['Pierre', 'Vinken']))

In [None]:
sorted('Chef Charlie chopped chilli peppers while chef Chloe charred chicken.'.split())

In [None]:
sorted('Chef Charlie chopped chilli peppers while chef Chloe charred chicken.'.split(), key=str.lower)

In [None]:
list(reversed(range(10)))

In [None]:
list(zip('abc', range(3)))

#### `open()` and `print()`

These functions work with files and buffers for reading and writing text or bytes.

In [None]:
import tempfile
import os
with tempfile.TemporaryDirectory() as tmpdirname:  # don't worry about tempfile; you don't need to know it
    path = os.path.join(tmpdirname, 'myfile.txt')

    # write a string to *path* (use 'with open...')
            
    # read it back and print
    

### Regular Expressions

Regular expressions are used for matching patterns in text.

In [None]:
import re
re.match(r'one', 'three two one')  # no match: match() must match the start of the string
re.search(r'one', 'three two one')  # search will find the first match

### File I/O, Buffers, and Encodings

Files and responses to web requests are **buffers** that hold a portion of some data at a time. They generally contain bytes which must be decoded using the appropriate **encoding** to get a unicode string.

In [None]:
from urllib import request
response = request.urlopen('http://andrew.triumf.ca/multilingual/samples/korean.html')
# response is a buffer, use its read() method to store the data in a variable named raw
raw = response.read()

In [None]:
# decode the bytes into a unicode string using the 'ksc5601' encoding
s = raw.decode('ksc5601')
print(s)

## NLP and the NLTK

You are expected to know some basic tasks in the field of natural language processing and how to accomplish those tasks using the NLTK (with documentation, perhaps).

In [None]:
import nltk  # run this

### Working with Corpora

We've discussed two main kinds of corpora: **lexical resources** and **texts**.

* What are the differences between lexical resources and texts?
* Which of the following are lexical resources?
  - Wordnet
  - Newspaper archives
  - Spoken dialog
  - Bilingual dictionaries
  - Lists of stopwords

In [None]:
# import the nltk.corpus.stopwords corpus
# get the list of 'indonesian' stopwords

In [None]:
# count how many sentences are in the Herman Melville story Moby Dick (from the nltk.corpus.gutenberg corpus)


### Frequency Distributions

Frequency distributions are used to count things. Conditional frequency distributions count things separately for different contexts.

In [None]:
# create a frequency distribution of the downcased words in Moby Dick, excluding English stopwords


### Tokenization

Tokenization breaks up a string into smaller strings. As this is often one of the first steps in **NLP pipelines** (sequential analysis tasks), the way it is done can have a large effect on downstream performance.

In [None]:
# tokenize the sentence: "Ms. Tan showed the flat to prospective renters from Taiwan, Japan, and Indonesia."



### Stemming and Lemmatization

Stemming removes morphological inflection while lemmatization changes inflected words to a base form. Both are kinds of lexical normalization used to reduce data sparsity.

In [None]:
# use nltk.stem.WordNetLemmatizer and nltk.stem.porter.PorterStemmer to lemmatize and stem some inflected words
import nltk.stem
import nltk.stem.porter


### Wordnets

Wordnets are highly-structured lexical resources that describe the relationships between words.

In [None]:
from nltk.corpus import wordnet as wn
# find the common ancestor synset for "dog" and "cat"

### N-grams and Sequence Models

N-grams are small slices taken from sequential data. E.g., a "bigram" is a 2-tuple, a "trigram" is a 3-tuple, and a "unigram" is a 1-tuple. They are used to provide generalized models of the order of items in a sequence.

In [None]:
# Produce a list of the bigrams in the sentence: "Kim likes milk tea."

In [None]:
# Create a conditional frequency distribution of bigrams for the sentences "the dog chased the cat" and "the dog slept".


In [None]:
# what is the most likely word to follow "the"?


### Part-of-speech Tagging

Words in a text can be annotated with their categories, called **parts of speech**, to provide a basic syntactic description.

In [None]:
# Use `nltk.pos_tag()` to tag the sentence "But Sandy enjoys coffee." (hint: you need to tokenize it first)


### Supervised Classification

Supervised classification is one kind of **machine learning**. It is **supervised** because each data instance is annotated with a task-specific label, and the machine has to learn the associations between properties of the instance and the label.

In [None]:
from nltk.corpus import movie_reviews  # load the movie_reviews corpus
movie_reviews.categories()

In [None]:
# create a variable 'data' that contains the labeled data
# hint: for each category, get the fileids for the category,
#       then associate all words in each file to its category
data = []


In [None]:
# get the size of 1/5 of the data
k = 5

In [None]:
# shuffle, then split the data into test and train portions
import random  # use random.shuffle()

In [None]:
# define a feature extractor function


In [None]:
# use the function to produce lists of (feature_dict, label) pairs for train and test


In [None]:
# train a naive bayes classifier with the training data


In [None]:
# evaluate the accuracy on the test data


In [None]:
# is that accuracy any good? We need to know the distribution of the test data

pos = len([label for _, label in test_feats if label=='pos'])
print(pos / len(test_feats))

68% accuracy is better than 52% (if we labeled all test items as "neg"). Note that These numbers may fluctuate depending on what `random.shuffle()` does above.