Info 159/259. Natural Language Processing

Introduction

Your overall goal for the annotation project is to come up with a task that is interesting (and ideally, useful). This is a loose definition, but some guidelines to help you think through it:

Task

Your annotation task must be one that requires your human judgment for the labels.
- There are many NLP tasks that do not require human judgments (do not consider these):
  - stock price prediction (will $GOOG go up or down on 2/1/22?) can use historical stock prices as labels.
  - censorship prediction (will a post on a social media site be censored?) can train a model on posts paired with a label of whether or not that post was deleted by the platform.
  - predicting political party affiliation (whether a press release was written by a Democrat or Republican) can use data of previous press releases by members of Congress paired with labels of their declared party membership.
- On the contrary, your labels should only be able to be given by a human reading through the text and exercising their judgment. You must be the ones to provide the judgments (they shouldn't already exist elsewhere).
You should manually be labeling your data, and not using algorithmic processes to do so. Let's say you're annotating all of the mentions of commercial products in text (e.g., "Tide", "BMW", "Nintendo Switch") and you have a list of 5000 products that you're looking for, so you write some code to read in that list and then automatically tag all mentions of those 5000 products in the data. This is not interesting! (There is no need to label data for this in order to train a supervised system; you could simply run your algorithm instead.) A better solution in this instance would be to not start with a list of pre-defined products, but rather to simply label all commercial products present in a text (where your work in the annotation guidelines would be to carefully define what counts conceptually as a "commercial product").
Your labels should not be deterministic, but really require some human comprehension of the context. Let's say you're annotating how suspenseful a literary passage is, and every time a passage contains the words "thunder" you rate it a 3 for suspense; if it contains "gasped" you rate it 2; if it contains "anticipated" you rate it a 1; and 0 otherwise. If you can write an algorithm like this that can fully deterministically predict your gold labels correctly, then it's not interesting enough. (Again, there is no need to label data for this in order to train a supervised system; you could simply run your algorithm instead.) A better solution in this instance would to be let individuals express their own subjective experience of suspense while reading a passage, but try to calibrate what intensity of suspense counts as a "3" vs. "2" vs. "1" so that their measures can be related to each other.
Your labels can be categorical (= classification task) or real-valued (= regression task).
Your labels must be scoped over either a single document or over contiguous token spans. We will provide formats that the data should be annotated in. Document classification annotations can be carried out as easily as in Excel; see BRAT (http://brat.nlplab.org; https://github.com/nlplab/brat) for a relatively easy to install system for span-level annotations. You can also carry out span-level annotations within software like Excel; just be mindful that you will be submitting those annotation using BIO tagging (cf. https://web.stanford.edu/~jurafsky/slp3/8.pdf, p. 7).
If you can't think of a new task, you may use existing NLP tasks (except no sentiment analysis), but try to think outside of the box: projects will be rewarded for their creativity and originality in coming up with a task that few people have considered before. For examples of creative work, see: how dogmatic is a forum post?; how respectful are police officers in their interactions at traffic stops?; how suspenseful is a passage from a story?; how much time is passing in it?

Team

Your team must have exactly 3 people in it. You'll have an option to allow us to randomly assign you to a team.

Data

You can carry out this annotation on data from any natural language.
All data you annotate must be shareable with the public, so no private information, and nothing within copyright. Keep privacy and ethics in mind as you are considering potential sources of data. (For example, don't use Twitter data, since their terms of service prohibit re-publishing tweets; this is an important privacy consideration so that Twitter users are able to permanently delete their tweets from the platform without having them exist in public elsewhere; to read more on this, see Fiesler and Proferes 2018).

Deliverables

- For AP0, you will turn in:
  - A list of all members of your team, or a request to randomly assign you to a group.
- For AP1, you will turn in:
  - A sample of your data.
  - A 300-word description of your task. This should include:
    - What are the categories you are considering annotating? These need not be fully finalized yet (AP2 will do that after you start digging into the data) but give us enough information to check if you're on the right track.
    - What is the source of your data? Include a statement that the data is neither private nor under copyright, and can be freely shared with others (only consider data that meet those criteria).
- For AP2, you will turn in:
  - your data with 1000 adjudicated labels.
  - your annotation guidelines.
  - your two separate annotations of the "evaluation" batch, which should include roughly 500 labels per annotator.
- For AP3, you will (individually) receive the guidelines and data from a separate group. You will turn in:
  - Annotations for data containing ~100 labels annotated according to those guidelines.
  - 300-word peer-review of the annotation guidelines you were provided.
  - Note this is an individual (not group) assignment)
- For AP4, you will build a classifier to automatically predict the labels using the data you've annotated. You will turn in:
  - A jupyter notebook containing your experimental results, along with confidence intervals on your held-out predictive accuracy.
  - A 300-word reflection on the successes and/or challenges of the classification task you have constructed.
- For AP5, you will document your annotated dataset using a datasheet. You will turn in:
  - A completed datasheet (as a Markdown file).

AP1

For Annotation Project deliverable AP1, your job is to come up with an annotation task and gather the data that you will be annotating. Be sure to read the annotation project intro (above) first, which has several important criteria to follow when proposing your task.

The deliverables are the following:

A 300-word description of your task (as a .txt file). This should include:
- What are the categories you are considering annotating? These need not be fully finalized yet (AP2 will do that after you start digging into the data) but give us enough information to check if you're on the right track.
- What is the source of your data? Include a statement that the data is neither private nor under copyright, and can be freely shared with others (only consider data that meet those criteria).
One of the documents that you will annotate (as a .txt file). (This of course will not be the full dataset you annotate; we just want to confirm that you have identified a source for data and are able to work with it). Once you've settled on your annotation task, you should gather the data right away so you can focus on annotating it for AP2.

If you would like to run an idea for an annotation task by the instructors before submitting this assignment, please do so as a post on Piazza (either public or private).

AP2

With AP2, you will be annotating your data (at least 1000 documents) and creating annotation guidelines where you will encode the criteria that you are using so that others can use it to annotate new documents with them. This assignment, the highest weighted of the AP deliverables (45%), is due on March 30, so you'll have a month to carry it out, but keep in mind that you'll have a lot to do in this time! As you carry out this assignment, both annotating data and creating guidelines, think beyond class to the life your project can have afterwards: ideally, you want to create something that you'd be happy to put out in the world (e.g., on Github) and have an impact both in the NLP community and in the world more broadly -- something that you can be proud of being known as its creators.

Data

Your final labeled data should contain at least 1000 adjudicated documents. Here's how that process should go.

Let's say you have 1000 documents. Separate them into two batches: "exploration" (500 documents) and "evaluation" (500 documents).
At the beginning of the project, use the exploration batch to iteratively create your annotation guidelines: all three group members should annotate some documents in this collection, and then compare where you disagree. Make decisions on your consensus for these disagreements and note them in the guidelines. Iterate on this process (as described in Pustejovsky and Stubbs, ch. 6) until you feel that the guidelines are solid. Don't consider any documents in the "evaluation" batch in this phase.
Once the guidelines are finalized, two annotators should each independently annotate all of the documents in the "evaluation" batch -- importantly, without talking to each other or discussing your annotations at this stage. Only use the decisions you've codified in the annotation guidelines. We'll calculate the inter-annotator agreement using these two independent annotations. Remember, the IAA rate here is giving a sense of the level of agreement two random annotators in the future will have on this task (which we'll verify in AP3), so it's important that you not collaborate on your annotations at this stage or your IAA will be artificially inflated.
Using those same guidelines, two annotators should each go back and independently annotate all of the documents in the "exploration" batch. We don't want to use this data for calculating IAA (since you've been collaborating on it already to decide on the final guidelines), but you still want to have it annotated to create the size of your dataset for training and evaluating models later.
At this point, you now have two independent annotations for all the documents in the "exploration" and "evaluation" set. You will naturally have disagreements at this point. The third group member should now act as an adjudicator for those disagreements to produce a single, final version of the labeled data.
In terms of the distribution of labor, there are several options. If your group contains Luke, Han and Leia, this can either a.) involve Luke and Han creating the two separate independent annotations for all documents, with Leia acting as adjudicator for all of them; or b.) mixing it up by document, as in the following:

Document 1, Luke + Leia separate annotations; Han adjudicates any disagreement to produce final version
Document 2, Luke + Han separate annotations; Leia adjudicates any disagreement to produce final version
...
Document N, Leia + Han separate annotations; Luke adjudicates any disagreement to produce final version

Guidelines

Your annotation guidelines should be sufficiently detailed so that a third party will be able to produce your judgments on a set of new documents that you have not seen. These guidelines should be at least two pages long (and potentially longer depending on the complexity of your task and the experience you had during the exploration phase making consensus decisions on any disagreements. Keep in mind two things:

For deliverable AP3, another group will be using only your guidelines to make judgments about your task, so be sure it contains enough information for them to do so.
In the bigger picture, if you choose to publish your dataset on Github after class, people will look to your annotation guidelines to understand the task and the exact criteria for the different categories, so be sure it is polished.

Here are some sample annotation guidelines so you can see what they look when when created by NLP researchers. While these of course are very detailed, they should give you a sense of the kind of specificity that such guidelines can contain.

Coreference (Ontonotes): https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-coreference-guidelines.pdf
Entity tagging (ACE): https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v5.6.6.pdf
Discourse (PDTB): https://catalog.ldc.upenn.edu/docs/LDC2019T05/PDTB3-Annotation-Manual.pdf
Aspect-based sentiment analysis (SemEval 2014 4): https://alt.qcri.org/semeval2014/task4/data/uploads/semeval14_absa_annotationguidelines.pdf
Contextual abuse: [bCourses]

Deliverables

For AP2, you will turn in:

Your data with 1000 adjudicated labels. Name this "adjudicated.txt". This data should have four tab-separated columns:
- data point ID (a unique integer)
- the word "adjudicated"
- the label
- the original text (not tokenized)
Your annotation guidelines. Name this "guidelines.pdf"
Your individual annotations of the "evaluation" batch, which should include roughly 500 labels per annotator if using option a. above (Luke and Han annotate everything) or 333 labels per annotator if using option b (Luke, Han and Leia alternate being primary annotators per document). Name this "individual_annotations.txt". This data should have four tab-separated columns:
- data point ID (a unique identifier)
- the annotator ID (e.g., "dbamman")
- the label
- the original text (not tokenized)

(Note that each data point ID in this file should have 2 rows corresponding to the two separate annotations by different annotator IDs. See AP/sample_individual_annotations.txt in the nlp22 Github repo for an example).)

The output of AP/Data Validator.ipynb (from the nlp22 Github repo) on your complete data (this checks to make sure all of the data files you submit are in the proper format). After changing the paths to point to your data, be sure all cells execute successfully.

AP3

In this homework, you will be using another group's guidelines to annotate a sample of data. Consider this primarily as a form of peer feedback: as a person external to the group, you are testing how robust those guidelines are when used by someone who did not develop them. You have been assigned another group's guidelines and a sample of ~100 data points (see links below); your primary task is to use those guidelines to annotate that data sample, and to provide written feedback for the group.

Note that this is an *individual* assignment, and you should not discuss your annotations with others in class (and certainly not coordinate with the group whose guidelines you have been given).

Deliverables:

The dataset assigned to you, annotated according to the guidelines you have been given. The dataset has four tab-separated columns, three of which have already been filled in for you; you should only complete the "label" column:
- Data ID (this is provided for you)
- annotator (your student ID has been provided here)
- label (provide your judgment for the text in this row, according to the guidelines)
- text (this is provided for you)

Be sure to only use the guidelines in shaping your judgment; while the task you have been given may be familiar enough that you think you may know how to do it, make sure you ground every judgment you make in the guidelines you have been given; those guidelines may define the boundaries of the concept in a way that's different from how you would have defined it. Upload this file as "annotations.txt"

300-word (+/- 10%) written feedback for the group (in the form of a text file) on their annotation guidelines. Consider if the group were to release those guidelines to the broader world; how could the group improve them? Are the category boundaries clear? Are there cases where the guidelines allowed too much ambiguity? Do the examples provided in the guidelines match the complexity of data you actually had to annotate? Do the guidelines contain enough detail that they were sufficient in guiding you in your annotations? Upload this file as "comments.txt"

You can access the guidelines and data that have been assigned to you at the following URL:

Guidelines: ***
Data: ***

AP4

For the final annotation project deliverable, you have two tasks: a.) build a predictive model for the data you've annotated and b.) analyze its performance.

We've randomly divided the adjudicated data you created into training (60%), development (20%) and test (20%) splits, and have implemented two simple models: a majority class classifier (which takes the most frequent class in the training data and predicts that label for every data point in the evaluation set) and a simple L2-regularized logistic regression model with a bag-of-words featurization. You can find those results for your model, along with all others in the class, here: ***

You should use the data in this directory for this assignment (with its corresponding train/dev/test splits), and not any previous version of this data you might have had.

Build a predictive model

For part a.) of this assignment, try to improve on those baselines using all of the knowledge you have accumulated in annotating this data, along with your knowledge about NLP that you've gathered in this class. This can include incorporating structure that you know is important (e.g., document structure, syntax), bringing in external resources (e.g., dictionaries), or using more sophisticated models (e.g., BERT). Feel free to draw on any (or none!) of the following notebooks in the nlp22 AP/ directory:

- Logistic regression
- Logistic regression for Chinese-language data
- Ordinal regression (for ordered labels like [1, 2, 3, 4, 5] or [G, PG, PG-13, R])
- BERT
- Sentence MEMM (for span annotations in BIO format)

When you report the accuracy of your classifier, be sure to also report the 95% confidence intervals around that number (see the Logistic regression notebook for code to do so).

Analysis

For part b.), use the model you've trained on your data to tell us something about the phenomenon you've annotated. Potential ideas for this could include:

Does your model learn features of the phenomenon that you didn't consider in your guidelines that might cause you to rethink the category boundaries? (See Long and So 2016 for an example.)
What labels are often mistaken for each other? (e.g., using a confusion matrix)
What features are learned to most define the classes? (E.g., see table 2 in Zhou and Jurgens 2020).
What kind of systematic mistakes does your model make? This could involve reading through test predictions and manually categorizing mistakes that are made (see Manning 2011 section 3 for an example).
Are there any biases your model makes? (E.g., by performing worse on different dialects or registers of the language -- see Blodgett et al. 2016 for an example.)
Think about the level of balance in your dataset: Is one label extremely prevalent? How could this impact the model you developed? Is your dataset a good candidate for strategies like oversampling or changing class weights?

Feel free to analyze this with any model (not necessarily your best performing one) -- e.g., you may use a linear model like logistic regression for this part. The kind of this analysis serves multiple ends -- it helps communicate what kind of information others should know about if they were to use your model in practice, and can also shed light on aspects of the fundamental concept that you hadn't considered while annotating. You will be graded in this part on the depth of your analysis (where e.g. simply printing a confusion matrix or the top logistic regression coefficients is less complex than a full error analysis, or synthesis of multiple analyses).

Deliverables:

Jupyter notebook containing the model you have trained on part a., along with a report of its performance on the test data (with confidence intervals) and all of the analysis you carry out for part b.) You may submit multiple notebooks; for each notebook, also submit a pdf of that notebook with all of its cells executed.

AP5

(Note: this was a final, reflective assignment, but I'd likely move this up to the front in future years.)

"Datasheets for Datasets" (Gebru et al. 2021) is an effort to document the important choices that go into the creation of datasets in order to help inform the consumers of those datasets -- e.g., the people who will be using them to train models and build systems that might be deployed in the real world. The questions in a datasheet help focus attention on several important issues -- why was the dataset created? What's in it? How was it collected? Who labeled it? The questions in a datasheet should be considered before the data was collected (in order to help inform that decisions that you make in response to your answers), but you can also use them here as an opportunity to reflect back on your process of building your annotation project -- now that you have carried out this project, what would you have done differently if building something similar in the future?

Your job is to read the "Datasheets for Datasets" paper and answer the questions posed in the datasheet about your annotation project data: https://github.com/JRMeyer/markdown-datasheet-for-datasets/blob/master/DATASHEET.md

To do so, download that datasheet Markdown file and answer those questions within that document; refer to the original paper if you have questions about them. Your sole deliverable is that completed markdown file. Do your best to answer all of the questions, but if any question does not apply to your task, feel free to answer "Not applicable".

NLP Annotation Project

Data

Guidelines

Deliverables

Deliverables:

Build a predictive model

Analysis