Info

This course introduces students to natural language processing and exposes them to the variety of methods available for reasoning about text in computational systems. NLP is deeply interdisciplinary, drawing on both linguistics and computer science, and helps drive much contemporary work in text analysis (as used in computational social science, the digital humanities, and computational journalism). We will focus on major algorithms used in NLP for various applications (part-of-speech tagging, parsing, coreference resolution, machine translation) and on the linguistic phenomena those algorithms attempt to model. Students will implement algorithms and create linguistically annotated data on which those algorithms depend.

Texts

  • [SLP3] Dan Jurafsky and James Martin, Speech and Language Processing (3nd ed. draft) [Available here]
  • [E] Jacob Eisenstein, Natural Language Processing (2018) [Available on bCourses]
  • [G] Yoav Goldberg, Neural Network Methods for Natural Language Processing (2017) [Available for free on campus/VPN here]
  • [PS] James Pustejovsky and Amber Stubbs, Natural Language Annotation for Machine Learning (2012) [Available for free on campus/VPN here]

Syllabus

(Subject to change.)

Week Date Topic Readings
18/23Introduction [slides]SLP2 ch 1
28/28Text classification 1 [slides]SLP3 ch 4
8/30Text classification 2; logreg [slides]; HW1 out on bCourses (due 9/6)SLP3 ch 5; G 4
39/4Text classification 3; MLP and convolutional neural nets [slides]G 13
9/6Construction of truth; ethics [slides]PS ch. 6; Hovy and Spruit 2016
49/11Language modeling 1 [slides]SLP3 ch 3
9/13Language modeling 2; RNN [slides]G 14
59/18Vector semantics and word embeddings [slides]SLP3 ch 6
9/20Embeddings 2 (character + sentence embeddings; embeddings in context) [slides] Bojanowski et al. 2017, Peters et al. (2018)
69/25Sequence labeling problems: POS tagging; HMM [slides]SLP3 ch 8
9/27MEMM, CRF [slides]SLP3 ch 8
710/2Neural sequence labeling [slides]E 7
10/4Context-free syntax [slides]SLP3 ch 10
810/9Context-free parsing algorithms (David Gaddy) [slides]SLP3 ch 11, 12
10/11Review [slides]
910/16Midterm
10/18Dependency syntax [slides]SLP3 ch 13
1010/23Dependency parsing algorithms [slides]SLP3 ch 13
10/25Compositional semantics [slides] E 12
1110/30Semantic parsing [slides]E 12, SLP3 10.6 (CCG)
11/1Semantic role labeling [slides]SLP3 ch 18
1211/6Wordnet, supersenses and WSD [slides]SLP3 ch 19
11/8Coreference resolution [slides] E 15
1311/13Conversational agents; sequence-to-sequence models [slides]SLP3 ch 24
11/15Information extraction [slides] E 17
1411/20Question answering (Class cancelled due to smoke)SLP ch 23
11/22No class (Thanksgiving)
1511/27Machine translation; attention [slides] E 18, G 17
11/29Future and review [slides]
RRR12/4Final project presentations (202 South Hall)

Prerequisites

  • — Algorithms: Computer Science 61B
  • — Probability/Statistics: Computer Science 70, Math 55, Statistics 134, Statistics 140 or Data 100
  • — Strong programming skills

Grading

Info 159

30% 7 short homeworks
30% 4 long homeworks
20% Midterm exam
20% Final exam

Info 259

25% 7 short homeworks
25% 4 long homeworks
20% Midterm exam
30% Project:
      5% Proposal/literature review
      5% Midterm report
      15% Final report
      5% Presentation

One short homework can be dropped (i.e., the "short homework" grade will be calculated as the 6 highest-scoring homeworks turned in).  All long homeworks must be turned in.

Project

Info 259 will be capped by a semester-long project (involving one to three students), involving natural language processing -- either focusing on core NLP methods or using NLP in support of an empirical research question. The project will be comprised of four components:

  • — Project proposal and literature review. Students will propose the research question to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
  • — Midterm report. By the middle of the course, students should present initial experimental results and establish a validation strategy to be performed at the end of experimentation. (4 pages; 10 sources)
  • — Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
  • — Presentation. At the end of the semester, teams will present their work in a poster session.
All reports should use the ACL 2018 style files for either LaTeX or Microsoft Word.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks must be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. All homeworks and project deliverables are due at time and date of the deadline.

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

Late assignments

Student have will have a total of two late days to use when turning in homework assignments (not project deliverables for Info 259); each late day extends the deadline by 24 hours.