Info

This course examines the use of natural language processing as a set of methods for exploring and reasoning about text as data, focusing especially on the applied side of NLP — using existing NLP methods and libraries in Python in new and creative ways (rather than exploring the core algorithms underlying them; see Info 159/259 for that).

Students will apply and extend existing libraries (including scikit-learn, keras, gensim and spacy) to textual problems. Topics include text-driven forecasting and prediction (using text for problems involving classification or regression); experimental design; the representation of text, including features derived from linguistic structure (such as parts of speech, named entities, syntax, and coreference) and features derived from low-dimensional representations of words, sentences and documents; exploring textual similarity for the purpose of clustering; information extraction (extracting relations between entities mentioned in text); and human-in-the-loop interactive NLP. This class will focus both on modern neural methods for these problems (including architectures such as CNNs, RNNs, LSTMs, and attention) and on classical methods (logistic/linear regression, Bayesian models).

This is an applied course; each class period will be divided between a short lecture and in-class lab work using Jupyter notebooks (roughly 50% each). Students will be programming extensively during class, and will work in groups with other students and the instructors. Students must prepare for each class and submit preparatory materials before class; attendance in class is required.

This course is targeted to graduate students across a range of disciplines (including information, English, sociology, public policy, journalism, computer science, law, etc.) who are interested in text as data and can program in Python but may not have formal technical backgrounds.

Prerequisites

Graduate student status; proficient programming in Python (programs of at least 200 lines of code).

Syllabus

(Subject to change.)

Week Date Topic Readings
11/22Introduction [slides]NLTK 1
1/24Words [slides]NLTK 3; Potts
21/29Finding distinctive terms [slides]Kilgarriff 2001 (up to p. 248); Monroe et al. 2009 (up to 3.3)
1/31Dictionaries [slides]Stewart and Grimmer (2011) (up to section 5.2)
32/5Text classification [slides]NLTK 6 Scikit-learn tutorial
2/7Text regression [slides]Foster et al. 2013; Scikit-learn linear models
42/12Feature generation; experimental design; ethics [slides] NLTK 6 ; Dror et al. 2018; Hovy and Spruit 2016
2/14Bootstrap; permutation tests [slides] A Gentle Introduction to the Bootstrap Method
52/19Lexical semantics/word embeddings [slides]SLP3 ch. 6 Gensim word2vec tutorial
2/21Word embeddings 2 [slides]FastText; Gensim FastText blog post
62/26Neural text classification: MLP [slides]Keras sequential model guide
2/28Neural text classification: CNN [slides] Keras functional model guide; Understanding CNNs for NLP
73/5(Class cancelled)
3/7Neural text classification: LSTM [slides]Understanding LSTMs
83/12Attention [slides]
3/14 Annotating data [slides] Artstein and Poesio 2008
93/19 Wordnet [slides] SLP3 appendix C; NLTK 2
3/21 Wordnet [slides] SLP3 appendix C; NLTK 2
103/26 SPRING BREAK 
3/28SPRING BREAK 
114/2POS tagging [slides]
SLP3 ch. 8; Parrish blog post
4/4 Named entity recognition [slides] SLP3 ch. 17
124/9 Sequence labeling [slides] SLP ch. 9; Blog post
4/11 Multiword expressions [slides] Manning & Schütze (1999); Sag et al. 2001
134/16 Dependency parsing [slides] SLP3 ch 13
4/18 Coreference resolution [slides] Spacy neural coref
144/23 Information extraction 1 SLP ch. 17
4/25 Information extraction 2 SLP ch. 17
154/30 Text clustering Blog post; Scikit-learn clustering
5/2Text dimensionality reduction  
RRR5/7Course project presentations 1
5/9Course project presentations 2

Grading

10% Participation
40% Homeworks
50% Project:
      5% Proposal/literature review
      15% Midterm report
      25% Final report
      5% Presentation

We will typically have a short homework due before each class (no late homeworks will be accepted); each homework will be graded as {check+, check, check-, 0}.

Project

Info 256 will be capped by a semester-long project (involving one to three students), involving natural language processing in support of an empirical research question. The project will be comprised of four components:

  • — Project proposal and literature review. Students will propose the research question to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
  • — Midterm report. By the middle of the course, students should present initial experimental results and establish a validation strategy to be performed at the end of experimentation. (4 pages; 10 sources)
  • — Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
  • — Presentation. At the end of the semester, teams will present their work to the class and broader Berkeley community.
All reports should use the ACL 2018 style files for either LaTeX or Microsoft Word.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks must be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. All homeworks and project deliverables are due at time and date of the deadline.

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.