# Week 6

This week is about getting data into Python from external sources, such as files on your computer or online. When working with these kinds of sources, we need to understand **character encodings** and **streams**. Additionally, this week we will cover **string formatting**, as it is useful when writing to files or to the terminal.

In [74]:
import nltk  # make sure NLTK is installed and loaded

## Unicode

Every character displayed by your computer is assigned a number. Before, each character set (e.g., for a language) chose different numbers for the characters, but this made it difficult to have documents with more than one character set. [Unicode](https://unicode.org/) is the modern standard for assigning these numbers, and it is one giant table comprising all the known characters, including some non-language characters (ü•≥ü¶•üå§...). In Python, strings are "pure" sequences of codepoints. You can find the codepoint (as an integer) of a character with Python's `ord()` function, and the character for a codepoint with the `chr()` function:

In [75]:
ord('Z')

90

In [76]:
chr(129445)

'ü¶•'

In practice these two functions are used rarely, however.

## Encodings

Whenever a unicode string needs to be stored or transmitted outside of Python it must be encoded into a sequence of bytes.

In [77]:
'„ÅÇ'.encode('utf-8')

b'\xe3\x81\x82'

Similarly, bytes can be decoded to strings:

In [78]:
b'\xe3\x81\x82'.decode('utf-8')

'„ÅÇ'

Notice that the `bytes` objects (the strings prefixed with `b`) use escape sequences to represent the bytes, such as `\xe3` which represents the bits `1110 0011` (note: you do not need to know this conversion). Python also accepts escape sequences in regular strings, but numbers do not represent UTF-8 or some other encoding, but the numeric value of the codepoint (you do not need to learn these escapes, just recognize that `\x`, `\u` and `\U` followed by 2, 4, or 8 hexadecimal digits (0123456789ABCDEF) is a unicode escape).

In [79]:
'\u3042'

'„ÅÇ'

Aside: If you want to find out the decimal value of the hexadecimal number, use the `int()` function with a base of 16:

In [80]:
int('3042', 16)

12354

And you can get back the hexadecimal version with `hex()`:

In [81]:
hex(12354)

'0x3042'

If you try to encode something not representable in the target encoding, you'll get an error. In this case, the letter '√©' is not part of the `ascii` encoding:

In [82]:
'caf√©'.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)

You can tell Python what to do in case of errors, such as ignoring them (note that the letter doesn't appear in the output):

In [83]:
'caf√©'.encode('ascii', errors='ignore')

b'caf'

## Streams

When you have a string in Python, you have the entire contents and you can query its length or access any character at once. When you're working with *streams*, however, you only get a small slice, or window, at a time. This is useful when the data is too large to fit into memory (like a dump of all of Wikipedia), or something that is slow to download.

Here we will download the text of a book from Project Gutenberg (not using the NLTK):

In [84]:
import urllib.request

# urlopen() returns a stream, but then we call .read(), which fetches the whole thing.
# The result is a bytes object, not str.
bytestring = urllib.request.urlopen('http://gutenberg.org/files/13083/13083-0.txt').read()


In [85]:
type(bytestring)

bytes

Depending on the language, the data may not be very readable:

In [86]:
bytestring[1004:1046]

b'\xc3\xbastavu pro psychologii a v\xc3\xbdchovu robot\xc5\xaf'

So we need to decode it:

In [87]:
string = bytestring.decode('utf-8')
type(string)

str

Now we can read the string (if we could read Czech). Note that the indices of the bytestring don't always line up with those of the string.

In [88]:
string[974:1013]

'√∫stavu pro psychologii a v√Ωchovu robot≈Ø'

We can also read and write files on disk using `open()`. Let's write the downloaded bytes directly to disk using `open()`'s `wb` ("write bytes") mode:

In [89]:
with open('myfile.txt', 'wb') as f:
    f.write(bytestring)

Now confirm that we have written the file. You may need to change the encoding from `utf-8` to `utf-8-sig` on Windows.

In [90]:
with open('myfile.txt', encoding='utf-8') as f:
    string = f.read()
string[:100]

'\ufeffThe Project Gutenberg eBook, R.U.R., by Karel ƒåapek\n\n\nThis eBook is for the use of anyone anywhere '

Instead of writing bytes directly, if you have the decoded string you can write in "text" mode (`wt`, or just `w`). In this case, it's best to specify your desired encoding. Also note that instead of `f.write(bytestring)`, you can use `print(string, file=f)`.

In [91]:
with open('myfile2.txt', 'wt', encoding='utf-8') as f:
    print(string, file=f)

## String Formatting

When printing to the terminal or writing to disk, sometimes it helps to format the strings so they are more legible or so they follow a particular file format. The `str.format()` method or "f-strings" are two common ways to do so (see this week's reading for explanation of these).

Write some code that takes a string and prints a table of each letter found in the string with its frequency. The frequency should be right aligned so number columns (ones, tens, etc.) line up. Don't use NLTK's `nltk.FreqDist.tabulate()` method, but you may use `nltk.FreqDist` to get the frequency information. You may choose to filter out non-letter characters.

In [92]:
# recall we can get the frequency distribution of a sequence (of words, or characters, etc.) with nltk.FreqDist
import nltk
with open('myfile.txt') as f:
    # `f.read()` returns the full string of the file
    # `if c.isalpha()` only keeps alphabetic characters (optional)
    fd = nltk.FreqDist(c for c in f.read() if c.isalpha())

In [93]:
fd

FreqDist({'e': 10013, 'o': 7979, 'a': 6453, 'n': 6291, 't': 6016, 'l': 4945, 'i': 4759, 's': 4357, 'r': 3717, 'm': 3482, ...})

For our table, we can use a fixed width between the character and the count, but here I first calculate the largest frequency then find its width when it is a string. This is the widest number that we will display. (This step is optional).

In [94]:
maximum = max(fd.values())
width = len(str(maximum)) 

Next we go over each letter in most-common-first order, then print the letter, a tab character (`\t`), then the count right aligned in a span using the width we just calculated.

In [95]:
for c, count in fd.most_common():
    # here I use f-string formatting. The same could be done with:
    # print('{c}\t{count:>{width}}'.format(c=c, count=count, width=width))
    print(f'{c}\t{count:>{width}}')

e	10013
o	 7979
a	 6453
n	 6291
t	 6016
l	 4945
i	 4759
s	 4357
r	 3717
m	 3482
d	 3119
u	 3050
v	 2840
k	 2324
h	 2198
c	 2187
y	 2076
j	 2027
p	 2000
√≠	 1930
√°	 1922
b	 1677
ƒõ	 1270
z	 1095
≈æ	  941
H	  894
≈°	  774
≈ô	  717
D	  699
ƒç	  693
P	  610
A	  544
√©	  520
R	  481
N	  474
√Ω	  433
f	  366
G	  362
T	  352
g	  352
J	  299
≈Ø	  287
B	  263
w	  255
S	  236
q	  210
V	  201
C	  194
F	  192
M	  173
O	  163
U	  149
K	  130
E	  128
ƒè	  109
I	  108
Z	  107
L	   90
≈•	   67
≈à	   57
√∫	   42
√≥	   35
x	   33
≈ò	   30
Y	   29
ƒå	   24
≈Ω	   24
W	   16
√ì	   14
≈†	    9
√ö	    5
√ç	    5
ƒö	    3
X	    2
≈§	    1
√â	    1
Q	    1
