Info 159/259. Natural Language Processing

Info

This course introduces students to natural language processing and exposes them to the variety of methods available for reasoning about text in computational systems. NLP is deeply interdisciplinary, drawing on both linguistics and computer science, and helps drive much contemporary work in text analysis (as used in computational social science, the digital humanities, and computational journalism). We will focus on major algorithms used in NLP for various applications (part-of-speech tagging, parsing, coreference resolution, machine translation) and on the linguistic phenomena those algorithms attempt to model. Students will implement algorithms and create linguistically annotated data on which those algorithms depend.

Staff

David Bamman (dbamman@berkeley.edu), OH: Wed 10am-11am (314 South Hall); 11am-noon (Zoom; Queue).

TAs (info159259-instructors@lists.berkeley.edu):

Kent Chang (kentkchang@berkeley.edu)

Jingshu Rui (jingshu_rui@berkeley.edu)

Tim Schott (timschott@berkeley.edu)

Aryia Dattamajumdar (aryia.datta@berkeley.edu)

Rachel McCarty (rachelmccarty@berkeley.edu)

Divya Tadimeti (dtadimeti@berkeley.edu)

Nancy Xu (yinuoxu54@berkeley.edu)

TA Office Hours

Monday: 2-3pm, South Hall 210 (Tim/Divya)

Tuesday: 12-1pm, South Hall 107 (Kent)

Wednesday: 11am-noon, Zoom (Nancy; link on bCourses)

Thursday: 10am-11am, South Hall 205 (Rachel)

Friday: 12-1pm, Zoom (Jingshu/Aryia; link on bCourses)

Texts

[SLP3] Dan Jurafsky and James Martin, Speech and Language Processing (3nd ed. draft) [Available here]
[G] Yoav Goldberg, Neural Network Methods for Natural Language Processing (2017) [Available for free on campus/VPN here]
[PS] James Pustejovsky and Amber Stubbs, Natural Language Annotation for Machine Learning (2012) [Online access available for free through the UC library here]

Syllabus

(Subject to change.)

Week	Date	Topic	Readings	Assignments
1	1/17	Introduction [slides]	Browse http://nlpprogress.com
1	1/18	Lexical semantics/static word embeddings [slides]	SLP3 ch 6
2	1/24	Text classification 1: Logistic regression [slides]	SLP3 ch 5
2	1/26	Text classification 2: MLP and CNN [slides]	SLP3 ch 7.1-7.4; 7.6; G ch. 13
3	1/31	Text classification 3: Attention and transformers [slides]	SLP3 ch 10
3	2/2	Annotation [slides]	PS ch. 6
4	2/7	Language modeling 1 [slides]	SLP3 ch 3; SLP3 ch 7.5; 7.7; SLP ch 9.2
4	2/9	Language modeling 2: Contextual embeddings [slides]	SLP3 ch 10
5	2/15	Language modeling 3: Prompting methods and reinforcement learning from human feedback (RLHF) [slides]	Liu et al. 2021; Ouyang et al. 2022
5	2/16	Sequence labeling: POS tagging; HMM [slides]	SLP3 ch 8
6	2/21	Neural sequence labeling [slides]	SLP3 ch 9.3-9.6
6	2/23	Midterm 1 (in class)
7	2/28	Syntax [slides]	SLP3 ch 17; SLP3 ch 18
7	3/2	Context-free parsing algorithms [slides]	SLP3 ch 17
8	3/7	Dependency parsing algorithms [slides]	SLP3 ch 18
8	3/9	Semantic role labeling [slides]	SLP3 ch 24
9	3/14	Wordnet, supersenses and WSD [slides]	SLP3 ch 23
9	3/16	Coreference resolution [slides]	SLP3 ch 26
10	3/21	Information extraction [slides]	SLP3 ch 21
10	3/23	Midterm 2 (in class; cumulative)
11	3/28	Spring Break
11	3/30	Spring Break
12	4/4	Question answering [slides]	SLP3 ch 14
12	4/6	Machine translation [slides]	SLP3 ch 13
13	4/11	Text generation [slides]	SLP3 ch 15
13	4/13	Social NLP [slides]	Nguyen et al. 2020
14	4/18	Commonsense reasoning (Kent)	Storks et al. 2020
14	4/20	Latent variable models [slides]	Kim et al. 2019
15	4/26	Multimodal NLP [slides]	Bisk et al. 2020
15	4/28	Final project presentations

Prerequisites

— Algorithms: Computer Science 61B
— Probability/Statistics: Computer Science 70, Math 55, Statistics 134, Statistics 140 or Data 100
— Strong programming skills

Grading

Info 159

25%	Homeworks (HW)
25%	Annotation project (AP)
10%	Weekly quizzes
20%	Midterm exams: max(midterm 1, midterm 2)
20%	NLP subfield survey

Info 259

20%	Homeworks (HW)
20%	Annotation project (AP)
10%	Weekly quizzes
20%	Midterm exam: max(midterm 1, midterm 2)
30%	Project:
	5% Proposal/literature review
	5% Midterm report
	15% Final report
	5% Presentation

Lectures will be recorded through course capture and made available through bCourses; attendance at lectures is not required but highly encouraged. Weekly quizzes will test your knowledge of that week's lectures and readings, so be sure to watch the lecture to stay on track.

Annotation project

The most exciting applications of NLP haven't been invented yet. While much of this course will give you exposure to the common methods in NLP, you will also carry out an annotation project where you will get exposure to the entire NLP design process for building a classifer for a brand new task. You will decide on a new document classification NLP task, annotate data to support it (including creating annotation guidelines), measure your inter-annotator agreement rate, and build a classifier to predict those labels using the methods we discuss in class.

You may use existing NLP tasks (except sentiment analysis), but try to think outside of the box: projects will be rewarded for their creativity and originality in coming up with a task that few people have considered before, and for being able to create a comprehensive set of guidelines that lead to consistent third-party annotations (i.e., not by your team). To give you a sample of similarly creative new NLP tasks, consider the following work: how dogmatic is a forum post?; how respectful are police officers in their interactions at traffic stops?; how suspenseful is a passage from a story?; how much time is passing in it?

Deliverables

AP0. Form a project group of exactly 3 people and let us know who's in the group. Either select your group yourself or let us pair you randomly with other teammates.

AP1. Decide on a document classification annotation task. Any natural language is OK. No sentiment analysis! Collect data and tokenize it. All data must be shareable with the public, so no private information, nothing within copyright. Keep privacy and ethics in mind as you are considering potential sources of data.

AP2. Annotate the data, creating at least 500 labeled examples and a robust set of annotation guidelines that govern the decisions you make. All of the data must be manually annotated by each member of your group. Report your inter-annotator agreement rates. In a separate assignment, a different group will annotate your same data only using your own annotation guidelines (and calculating their IAA), so make the guidelines comprehensive!

AP4. Build a classifier to automatically predict the labels using the data you've annotated.

NLP subfield survey (Info 159)

Understanding how to read and synthesize articles in NLP is an important part of carrying out research in this space. To cultivate this skill, your final report will be a 2000-word survey for a specific NLP subfield of your choice (e.g., coreference resolution, question answering, interpretability, narrative generation, etc.), synthesizing at least 25 papers published at ACL, EMNLP, NAACL, EACL, AACL, Transactions of the ACL or Computational Linguistics. This survey should be able to provide a newcomer (such as yourself at the start of the semester) a sense of the current state of the art in that subfield in 2023, the major historical papers that have defined that area, and the different schools of thought within it. The survey should use the ACL 2023 style files for formatting, which are available as an Overleaf template.

Project (Info 259)

Info 259 will be capped by a semester-long project (involving one to three students), involving natural language processing -- either focusing on core NLP methods or using NLP in support of an empirical research question. For examples of the former, see papers published at ACL, NAACL and EMNLP; for examples of the latter, see workshops for NLP and Computational Social Science, Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Natural Language Processing Techniques for Educational Applications, Noisy User-Generated Text, and many more.

The project will be comprised of four components:

— Project proposal and literature review. Students will propose the research question to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
— Midterm report. By the middle of the course, students should present initial experimental results and establish a validation strategy to be performed at the end of experimentation. (4 pages; 10 sources)
— Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (6 pages, not including references)
— Presentation. At the end of the semester, teams will present their work in a poster session.

All reports should use the ACL 2023 style files, which are available as an Overleaf template and source templates for LaTeX and Microsoft Word.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. You may discuss homeworks at a high level with your classmates (if you do, include their names on the submission), but each homework deliverable must be completed independently -- all writing and code must be your own. All quizzes and exams must be completed on your own. If you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here and this great infographic by Emily Myers.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. All homeworks and project deliverables are due at the time and date of the deadline. We have zero tolerance policy for cheating and plagiarism; violations will be referred to the Center for Student Conduct and will likely result in failing the class.

AI assistance

If you use the output of automatic writing assistants (e.g., ChatGPT) or code suggestions (e.g., Copilot), you must cite that source and be clear what text or code came from it (or was inspired by it). Note that this comes with three caveats. 1.) You retain responsibility for anything you submit (whether text or code); you should be prepared to demonstrate your understanding of it and have performed due diligence to check its correctness (as we will discuss, such language models are prone to fabricating about the world!); 2.) You should be honest about your use of these methods; this is a large class, and if you submit the same automatically generated code/text as another student without citing appropriately, we will treat it as plagiarism. 3.) The work you submit should still be largely your own, representing your own ideas, code and words, and submissions that rely too heavily on AI tools will not be graded very favorably. (This policy itself is inspired by that of Chris Potts.)

Ed Discussion

We'll use Ed Discussion as a platform for asking and answering questions about the course material, including homeworks. Students are encouraged to actively participate on this forum and help others by answering questions that arise (helpful students can see a grade bump across a threshold (e.g., B+ to A-) for this participation. When helping with homework questions, keep the discussion to the high-level concepts; don't post answers to homeworks or quiz/exam questions.

TA office hours

While at TA office hours, keep academic integrity in mind: you may discuss homework questions at a high level with others present, but don't discuss specific answers or share screens with code solutions. Neither the TA office hours nor Ed Discussion should be used for pre-grading (asking if a specific answer to a homework or quiz question is correct before the assignment is due).

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

Late assignments

Student have will have a total of three late days to use when turning in homework assignments and quizzes (not group annotation project deliverables or 259 project deliverables); each late day extends the deadline by 24 hours. If all late days have been used up, homeworks/quizzes can be turned in up to 48 hours late for 50% credit; anything submitted after 48 hours late = 0 credit. Each homework and quiz will be due at 11:59pm, and will have a 2-hour grace period for any last-minute submission issues. Late days and incompletes will be assessed immediately following the grace period (at 2:00am sharp). The grace period applies to late days as well (if a homework is due at 11:59pm 1/21, and you use a late day to extend it to 11:59pm 1/22, you may turn it in up to 2:00am 1/23 and still be assessed 1 late day.) Late days are assessed immediately once homeworks or quizzes are submitted late and can't be retroactively changed (if you submit 2 homeworks and 2 quizzes late, for example, you can't decide after the fact which ones to apply your 3 slip days to -- they apply to whichever homeworks or quizzes use them up first).

Curving

Grades for this course will not be curved. Minimum thresholds for letter grades are the following: 93 A, 90 A-, 87 B+, 83 B, 80 B-, 77 C+ 73 C, 70 C-, 67 D+, 63 D, 60 D-, 0 F. Students taking the course P/NP must complete all deliverables and will receive a P if their grade is greater or equal to 70 (C-); Students taking S/U must complete all deliverables and will receive an S if their grade is greater or equal to 80 (B-).

Exams

This course has two midterm exams scheduled for 2/23 and 3/23 (completed during class time, but not in person) and no final exam. We will not be offering alternative midterm exam dates, so if you anticipate conflicts with these dates, you should not register for this course. Your midterm exam grade for the course will be the max of midterm 1 and midterm 2 (you will drop the lowest-scoring midterm); if you are happy with your score for midterm 1, you do not have to take midterm 2. If you miss midterm 1 for any reason, your course midterm grade will be your score for midterm 2.