Updates
- I’ve updated the gold data to remove the part of the Semcor corpus that only annotated verbs. Please get the new files below.
- I have now uploaded the plain-text sentences (detokenized) for the train and test sets. Use these as inputs to your system like
sents.txt
. Download links are below. - The evaluation script is uploaded. Download link is below
- (2020-11-16) I added more information on the Report format
- (2020-11-16) I expanded on producing annotations and file redirection
- (2020-11-17) I clarified in the appendix that students are not expected to do anything for that section
- (2020-11-17) I explained why you would need to use the code in Looking Up Synsets From Synset-IDs
Task
You will improve upon a basic, automatic WordNet sense tagger. You can create a basic tagger using the NLTK as follows:
import nltk
def sense_tag(words):
pairs = []
for word in words:
synsets = wn.synsets(word)
if synsets:
pairs.append((word, synsets[0]))
else:
pairs.append((word, None))
return pairs
This only pairs words with the first synset WordNet has for the word, or None
if there are no synsets. You and your group will start with this basic, baseline model, and produce different versions with various improvements. You will then estimate the quality of your tagger against human-annotated gold data.
Directions for Students
The sense-tag.py
script in your project repository is already written and nothing needs to be changed unless you want to specify additional arguments. There are two modules in the hg2051project
package:
hg2051project/
├── __init__.py
├── tagger.py -- contains code for sense tagging
└── utils.py -- contains code for miscellaneous tasks (e.g., tokenization)
You are to modify the tagger.py
with any improvements to the sense tagging. Functions may be added or modified in utils.py
to improve the preprocessing or add postprocessing.
You are expected to implement the following improvements:
- adjust tokenization and text normalization to better capture the individual words
- use part-of-speech tagging to narrow the results (note that you will need to map your part-of-speech tagger’s tags to WordNet’s coarse set
n
,v
,a
,r
) - avoid annotating unlikely senses on common words (like
angstrom.n.01
for a) - use sense-tag frequencies (from a gold corpus; see below) to choose the most frequent sense
There are a few related tasks necessary to complete the assignment:
- evaluate your implementation against gold data
- write a report detailing your improvements and team member contributions
I also want to see you use some software-engineering skills to develop the project. For example:
- track your requirements with
requirements.txt
or similar - write clean, readable code
- write and run unit tests
- use GitHub issues and/or project boards to track which member is working on what
Note: for the most-frequent-sense, evaluation, and report tasks, details will be provided at a later time.
Gold Data
The gold data come from the SemCor corpus. You should use the training data for “eyeballing” (as we do below) and for training your system. You may also want to split off some development data to help with tuning your system. You should not evaluate with the gold data until you’re near completion.
- Training data: semcor-train.txt (UTF-8, 4.2M) semcor-train-sents.txt (UTF-8, 1.9M)
- Testing data: semcor-test.txt (UTF-8, 474K) semcor-test-sents.txt (UTF-8, 218K)
- License: semcor-license.txt
The corpus has both part-of-speech and WordNet sense annotations. The annotations, however, are on chunks and not words; that is, multi-word expressions get a single tag. For example, let’s look at the first example of the training data:
The
Fulton County Grand Jury NE
said state.v.01
Friday friday.n.01
an
investigation probe.n.01
of
Atlanta atlanta.n.01
's
recent late.s.03
primary election primary.n.01
produced produce.v.04
``
no
evidence evidence.n.01
''
that
any
irregularities abnormality.n.04
took place happen.v.01
.
You’ll note that some words, like The, of, or ’s, do not have any annotation. Named entitles, such as Fulton County Grand Jury only get NE
as the tag. NE
is not a WordNet sense; SemCor gave a WordNet sense for a generic kind of named entity, such as group.n.01
, but I discarded those because named entities are a difficult problem and I think we can just ignore them for this project. Some named entites, however, have a sense, such as Atlanta, because those senses exist in WordNet. Also note that some individual words (like irregularities) have one sense and some multi-word expressions (like took place) have one sense. If you look later in the file, you may also see that some sentences do not have annotation on words that clearly could be given a sense. That is, the annotation is incomplete. (update: I have removed this part of the annotation; it turns out that only verbs were annotated for half the corpus).
Reading the Gold Data
I have provided the load()
function in the evaluate.py
script (see below) which loads the Semcor files, and you are free to use it. If you have already written something that works but want to use my implementation in the end, keep your version around with a comment. I will consider the effort put into your version when assigning partial credit). The load()
function works like this:
>>> sentences = load('semcor-train.txt')
>>> first_sent = sentences[0]
>>> for words, label in first_sent[:3]:
... print(words, f'label={label!r}')
...
('The',) label=None
('Fulton', 'County', 'Grand', 'Jury') label='NE'
('said',) label='state.v.01'
Looking Up Synsets From Synset-IDs
The sense-tag.py
script expects sense_tag()
in tagger.py
to return a list of (word, synset) pairs where the synset is either a Synset
object from the NLTK’s WordNet module or None
, meaning no-synset. You therefore have a need to lookup synsets from the synset identifiers in the gold data (so sense_tag()
can return the proper objects). You should not try to split the word form and part-of-speech out of the synset identifier (e.g., to get "friday"
and "n"
from "friday.n.01"
) and then look it up with wn.synsets(...)
because it’s not guaranteed to get the same synset. Instead use wn.synset(...)
with the identifier as its only argument. However, there are some challenges when doing this. When you read the labels from the gold data, you might get one of the following:
- A valid synset ID, e.g.,
friday.n.01
None
- A non-synset label, e.g.,
NE
- An obsolete or invalid synset ID, e.g.,
called.s.00
If you try and call wn.synset(label)
with any except the valid ID, you’ll get an error, and each gives a different kind of error. Since we did not cover exception handling in class, I will give you the following code which looks up valid IDs and ignores the others:
try:
synset = wn.synset(label)
except (AttributeError, ValueError, nltk.corpus.reader.wordnet.WordNetError):
synset = None
You can put this in your function somewhere (and import nltk
) and it can help if you need to get the NLTK’s Synset
objects for these labels.
An alternative to the above is to modify sense-tag.py
so it doesn’t expect sense_tag()
to return Synset
objects, but in general I think the above is easier.
Producing Annotations
Once your code is working, you can run sense-tag.py
as a script with a sentence file as its argument. For example:
$ python sense-tag.py semcor-test-sents.txt
If all is good, you should see outputs that are formatted like the gold data printed to the terminal. If you want to write these to a file, you need to redirect them using the >
operator:
$ python sense-tag.py semcor-test-sents.txt > test.sys
This will create or replace the test.sys
file with the output of the sense-tag.py
command. Once the file is created, you can proceed to evaluate it.
Evaluation
An evaluation script is provided here: evaluate.py
You run it at the terminal with two arguments: the gold data and the corresponding system outputs which were created above. (note: the following uses the output of the baseline system without any modifications):
$ python evaluate.py semcor-test.txt test.sys
[...]
Totals:
Gold tokens : 44007
System tokens : 38435
Skipped tokens: 28566
Gold labels : 21412
System labels : 22772
Matched labels: 3327
Scores:
Precision: 0.15 (3327 / 22772)
Recall : 0.16 (3327 / 21412)
F-score : 0.15 (2PR / (P + R))
Gold and system tokens are the number of individual tokens (considering multi-word expressions) in each file. When the tokenization method does not produce the same tokens as in the gold, unaligned tokens will be skipped. This number is shown by “skipped tokens” (the default string.split()
is really bad! Even the default NLTK tokenizer is much better). Gold and system labels are how many sense annotations there are for each file. The “matched labels” is the number of labels that (a) were on aligned tokens and (b) the same label. Below, the following scores are calculated using the label counts:
- precision: proportion of system outputs that are correct
- recall: proportion of gold annotations correctly predicted by the system
- f-score: harmonic mean of precision and recall
For this task, calculating the accuracy (proportion of tokens that are correctly labeled or correctly not labeled) is difficult because the number of tokens is not the same for the gold and system files.
Notes:
- Your grade is not dependent on high scores (I want to see that you’ve done something reasonable for each task)
- You are not expected to get perfect tokenization (using
nltk.word_tokenize()
gets the skipped tokens down to 995, and that’s fine) - The
[...]
in the example above is for messages about misaligned sentences or tokens, which I’ve hidden above. You can use these messages for debugging, or just ignore them. - I ‘flatten’ the multi-word-expressions (MWEs) in evaluation and repeat the label assigned to them. This means you can get a higher f-score if you get the MWEs correct, but you can also get partial credit if you correctly annotate an individual word from the MWE.
Report
The final task is that you write a report describing your efforts. The report should be ~2 pages (or 6-12 paragraphs) It does not need to be a .pdf or .docx file, and instead you can just create a report.md
file (preferred) or even just a ## Report
section of README.md
. If you create a new file, don’t forget to add and commit it to Git.
In your report I would like to see the following:
- Who did what? I.e., what part of the assignment (which functions, or running the experiment, or writing the report) was each member responsible for?
- A description of the problem. That is, what issues with sense tagging do you see in the baseline system?
- What did you do to fix these problems? Beyond just a technical description of the code you wrote, I’d like to see discussion of your motivation. That is, why do you think your changes fixed the problem?
- What did you try that didn’t work? This could be a feature in your system that didn’t help in the end, or perhaps difficulties you had with using the NLTK, Python, etc.
- What would you like to do if you had more time? Short of doing the actual error analysis (which is extra credit), does anything seem like “low-hanging fruit” for further improvement? Or is there a method you’re curious about and would have liked to try?
Extra Credit
There are two options (you can do both) for extra credit. The extra credit contributes towards this project and not your overall course grade (you can get 100% without doing extra credit, but cannot go over 100% with it).
Multi-Word Expression (MWEs)
Sometimes the best sense doesn’t tag a single word but multiple words. For instance, for “coffee filter” you could either tag it as:
coffee coffee.n.01
filter filter.n.01
Or as:
coffee filter coffee_filter.n.01
In general, the more precise sense is better. For this task, you would need to decide how to group words, how to look them up, collectively, in wordnet, and how to decide if a sense for the multi-word expression is better than those for individual words.
Error Analysis
After you have finished implementing your improvements and have evaluted the tagger’s performance, do a more thorough examination of the cases that went wrong. Are there obvious categories of error types? What might have caused these errors? What would need to be done to fix them?
Appendix: Formatting the Gold Data
Note: You are not expected to do anything with the following code!
The gold data comes from the SemCor corpus. I used the code below to create the formatted semcor-train.txt
and semcor-test.txt
files. If you are curious about how it works, you can see the code below. You do not need to run or understand the code for your group project.
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import semcor
train = []
test = []
# Semcor has both POS and WordNet sense tags. Specify WN senses with 'sem'
# Use enumerate() to keep track of which sentence we're on
for i, sent in enumerate(semcor.tagged_sents(tag='sem'), 1):
sent_data = [] # reset list for each sentence
# tagged_sents() returns a list of chunks
for chunk in sent:
# semcor tags chunks of text as tree objects;
if isinstance(chunk, nltk.tree.Tree):
words = chunk.leaves() # leaves of the tree are a list of words
label = chunk.label() # label could be one of several things
# ignore any synset for named entities
if chunk.height() == 3 and chunk[0].label() == 'NE':
label = 'NE'
# some other string labels may be synset identifiers (but maybe not)
elif isinstance(label, str):
try:
label = wn.synset(label).name()
except (ValueError, nltk.corpus.reader.wordnet.WordNetError):
pass
# most will be Lemma objects; get the synset
elif isinstance(label, nltk.corpus.reader.wordnet.Lemma):
label= label.synset().name()
# ignore anything else
else:
label = ''
# untagged text appears as a list of words
else:
words = chunk
label = ''
sent_data.append(f'{" ".join(words)}\t{label}')
# Done with a sentence; put every tenth one into the testing data
if i % 10 == 0:
test.append(sent_data)
else:
train.append(sent_data)
# print to training file
with open('semcor-train.txt', 'w', encoding='utf-8') as file:
for lines in train:
for line in lines:
print(line, file=file)
print(file=file) # print a blank line after each sentence
# print to testing file
with open('semcor-test.txt', 'w', encoding='utf-8') as file:
for lines in test:
for line in lines:
print(line, file=file)
print(file=file) # print a blank line after each sentence