# Week 12

In this notebook we'll look at language models and using PyDelphin for semantic analysis.

Follow along at: https://www.nltk.org/api/nltk.lm.html#module-nltk.lm

# N-gram language models

## Prepare Data

Get sentences, add padding, get n-grams. Do this for the 

In [6]:
import nltk
from nltk.corpus import brown
news = brown.sents(categories='news')
len(news)

4623

In [18]:
from nltk.lm.preprocessing import pad_both_ends
from nltk.util import everygrams
padded = list(pad_both_ends(news[0], n=2))
list(everygrams(padded, max_len=2))
#list(nltk.ngrams(padded, n=1)) + list(nltk.bigrams(padded))

#list(nltk.bigrams(news[0], pad_left=True, pad_right=True))

[('<s>',),
 ('The',),
 ('Fulton',),
 ('County',),
 ('Grand',),
 ('Jury',),
 ('said',),
 ('Friday',),
 ('an',),
 ('investigation',),
 ('of',),
 ("Atlanta's",),
 ('recent',),
 ('primary',),
 ('election',),
 ('produced',),
 ('``',),
 ('no',),
 ('evidence',),
 ("''",),
 ('that',),
 ('any',),
 ('irregularities',),
 ('took',),
 ('place',),
 ('.',),
 ('</s>',),
 ('<s>', 'The'),
 ('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's"),
 ("Atlanta's", 'recent'),
 ('recent', 'primary'),
 ('primary', 'election'),
 ('election', 'produced'),
 ('produced', '``'),
 ('``', 'no'),
 ('no', 'evidence'),
 ('evidence', "''"),
 ("''", 'that'),
 ('that', 'any'),
 ('any', 'irregularities'),
 ('irregularities', 'took'),
 ('took', 'place'),
 ('place', '.'),
 ('.', '</s>')]

In [29]:
from nltk.lm.preprocessing import padded_everygram_pipeline
train, vocab = padded_everygram_pipeline(2, news)


In [33]:
from nltk.lm import MLE
lm = MLE(2)
lm.fit(train, vocab)

In [35]:
len(lm.vocab)

14397

In [40]:
lm.vocab.lookup(['Fulton', 'fulton'])

('Fulton', '<UNK>')

In [45]:
lm.counts['The']

806

In [46]:
lm.counts[['The']]['Fulton']

1

## Train 

Train ("fit") the model to a bunch of text from the NLTK corpora (e.g., all/most of Brown or Reuters).

## Use the model to score sentences

How well does the model score sentences in the same domain? How about other domains, such as a Gutenberg book?

In [50]:
lm.entropy(nltk.bigrams(news[0]))

4.762358770834322

In [51]:
lm.perplexity(nltk.bigrams(news[0]))

27.140187277828197

## Use the model to predict next words

# PyDelphin

Install PyDelphin like this: `pip install pydelphin[web]`

Now use it to parse a sentence using the English Resource Grammar (through `delphin.web.client`). Documentation is available at: https://pydelphin.readthedocs.io

In [52]:
from delphin.web import client

In [59]:
response = client.parse('The dog chased the cat.', params={'mrs': 'json'})

In [61]:
m = response.result(0).mrs()

In [63]:
m.rels

[<EP object (h4:_the_q(ARG0 x6, RSTR h7, BODY h5)) at 139753161129024>,
 <EP object (h8:_dog_n_1(ARG0 x6)) at 139753160424416>,
 <EP object (h2:_chase_v_1(ARG0 e3, ARG1 x6, ARG2 x9)) at 139753160424512>,
 <EP object (h10:_the_q(ARG0 x9, RSTR h12, BODY h11)) at 139753160424608>,
 <EP object (h13:_cat_n_1(ARG0 x9)) at 139753160424704>]

In [64]:
from delphin import dmrs
d = dmrs.from_mrs(m)
d

<DMRS object (_the_q _dog_n_1 _chase_v_1 _the_q _cat_n_1) at 139753150859488>

In [66]:
from delphin.codecs import simplemrs, simpledmrs
print(simplemrs.encode(m, indent=True))

[ TOP: h1
  INDEX: e3 [ e SF: prop TENSE: past MOOD: indicative PROG: - PERF: - ]
  RELS: < [ _the_q<0:3> LBL: h4 ARG0: x6 [ x PERS: 3 NUM: sg IND: + ] RSTR: h7 BODY: h5 ]
          [ _dog_n_1<4:7> LBL: h8 ARG0: x6 ]
          [ _chase_v_1<8:14> LBL: h2 ARG0: e3 ARG1: x6 ARG2: x9 [ x PERS: 3 NUM: sg IND: + ] ]
          [ _the_q<15:18> LBL: h10 ARG0: x9 RSTR: h12 BODY: h11 ]
          [ _cat_n_1<19:23> LBL: h13 ARG0: x9 ] >
  HCONS: < h1 qeq h2 h7 qeq h8 h12 qeq h13 > ]


In [67]:
print(simpledmrs.encode(d, indent=2))

dmrs {
  [top=10002 index=10002]
  10000 [_the_q<0:3>];
  10001 [_dog_n_1<4:7> x PERS=3 NUM=sg IND=+];
  10002 [_chase_v_1<8:14> e SF=prop TENSE=past MOOD=indicative PROG=- PERF=-];
  10003 [_the_q<15:18>];
  10004 [_cat_n_1<19:23> x PERS=3 NUM=sg IND=+];
  10000:RSTR/H -> 10001;
  10002:ARG1/NEQ -> 10001;
  10002:ARG2/NEQ -> 10004;
  10003:RSTR/H -> 10004;
}
