Info

This course examines the use of natural language processing as a set of methods for exploring and reasoning about text as data, focusing especially on the applied side of NLP — using existing NLP methods and libraries in Python in new and creative ways (rather than exploring the core algorithms underlying them; see Info 159/259 for that).

Students will apply and extend existing libraries (including scikit-learn, pytorch, gensim, spacy and huggingface) to textual problems. Topics include text-driven forecasting and prediction (using text for problems involving classification or regression); exploratory data analysis; experimental design; the representation of text, including features derived from linguistic structure (such as parts of speech, named entities, syntax, and coreference) and features derived from low-dimensional representations of words, sentences and documents; exploring textual similarity, information extraction (extracting relations between entities mentioned in text), and multmodal NLP. This class will focus both on modern neural methods for these problems (including architectures such as convolutional neural networks and transformers) and on classical methods (logistic/linear regression, Bayesian models).

This is an applied course; each class period will be divided between a short lecture and in-class lab work using Jupyter notebooks (roughly 50% each). Students will be programming extensively during class, and will work in groups with other students and the instructors. Students must prepare for each class and submit preparatory materials before class.

This course is targeted to graduate students across a range of disciplines (including information, English, sociology, public policy, journalism, computer science, law, etc.) who are interested in text as data and can program in Python but may not have formal technical backgrounds.

Office hours

David Bamman:
  • Wednesday, 10am-11am (Zoom; queue)
  • Thursday, 10am-11am (314 South Hall)

  • Shefali Bhatia:
  • Friday, 10:30-11:30am (Zoom)
  • Prerequisites

    Graduate student status; proficient programming in Python (programs of at least 200 lines of code), equivalent to INFO 206A/B.

    Syllabus

    (Subject to change.)

    Week Date Topic Readings Optional
    18/26 Introduction [slides] NLTK 1 Nguyen et al. 2020
    28/31 Words [slides] NLTK 3; Potts Manshel 2020; Fischer-Baum et al. 2020
    9/2 Finding distinctive terms [slides] Kilgarriff 2001 (up to p. 248); Monroe et al. 2009 (up to 3.3) Jurafsky et al. 2014; Mosteller and Wallace 1964
    39/7 Dictionaries [slides] Stewart and Grimmer (2011) (up to section 5.2) Lucy et al. 2020; Mendelsohn et al. 2020; Zhou and Jurgens 2020
    9/9 Lexical semantics/word embeddings [slides] SLP3 ch. 6 Gensim word2vec tutorial Shechtman 2021; Soni et al. 2021; Kozlowski et al. 2019
    49/14 Contextual embeddings [slides] Smith 2020; Devlin et al. 2018 Bamman and Burns 2020
    9/16 EDA: Text clustering [slides] Blog post; Scikit-learn clustering Nelson 2020; Wilkens 2016
    59/21 EDA: Topic models [slides] Blei 2012 Klein 2020; Antoniak et al. 2019; Demszky et al. 2019; Grimmer 2010
    9/23 Text classification 1: logistic regression [slides] NLTK 6; Scikit-learn tutorial Zhang et al. 2018; Broadwell et al. 2017
    69/28 Text regression [slides] Scikit-learn linear models Foster et al. 2013
    9/30 Hypothesis testing 1; ethics [slides] NLTK 6 ; Dror et al. 2018; Hovy and Spruit 2016 Field et al. 2021; Blodgett et al. 2020; Denny et al. 2018
    710/5 Hypothesis testing 2: bootstrap; permutation tests [slides] A Gentle Introduction to the Bootstrap Method Antoniak and Mimno 2017
    10/7 Annotating data [slides] Krippendorff 2018, "Reliability" (bCourses) Vidgen et al. 2021; Voigt et al. 2017
    810/12 Text classification 2: MLP/CNN [slides] SLP3 ch. 7; Understanding CNNs for NLP Iyyer et al. 2015; Zhang and Wallace 2016
    10/14 Text classification 3: Attention/BERT [slides] Huggingface fine-tuning tutorial Rogers et al. 2020
    910/19 Text classification 4: Few-shot learning and prompting methods [slides] Liu et al. 2021 Bender et al. 2021; Brown et al. 2020
    10/21 Interpretability [slides] Madsen et al. 2021
    1010/26 WordNet [slides] SLP3 18; NLTK 2 Tenen 2018
    10/28 POS tagging [slides] SLP3 ch. 8; Parrish blog post Gimpel et al. 2011
    1111/2 Named entity recognition [slides] SLP3 ch. 17 Erlin et al. 2021; Evans and Wilkens 2018
    11/4 Multiword expressions [slides] Manning & Schütze (1999); Sag et al. 2001 Handler et al. 2016; Lau et al. 2013
    1211/9 Dependency parsing [slides] SLP3 ch 14 Reeve 2017; Underwood et al. 2018
    11/11 Veteran's day -- no class
    1311/16 Coreference resolution [slides] Spacy neural coref Sims and Bamman 2020
    11/18 Information extraction [slides] SLP ch. 17 Keith et al. 2017
    1411/23 Sequence alignment [slides] Needleman–Wunsch; Smith–Waterman So et al. 2019; Wilkerson et al. 2014
    11/25 Thanksgiving -- no class
    1511/30 Domain adaptation [slides] Ramponi and Plank 2020 Han and Eisenstein 2019
    12/2 Multimodal NLP [slides] Bisk et al. 2020 Pineda and Mebane 2019; Papasarantopoulos et al. 2019
    RRR12/7Course project presentations 1
    12/9Course project presentations 2

    Grading

    10% Participation
    40% Homeworks
    50% Project:
          5% Proposal/literature review
          15% Midterm report
          25% Final report
          5% Presentation

    We will typically have a short homework due before each class (so no late homeworks will be accepted); each homework will be graded as {check+, check, check-, 0}. We will drop your 3 lowest-scoring homeworks when calculating your final grade.

    Project

    Info 256 will be capped by a semester-long project (involving one to three students), involving natural language processing in support of an empirical research question. The project will be comprised of four components:

    • — Project proposal and literature review. Students will propose the research question to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
    • — Midterm report. By the middle of the course, students should present initial experimental results and establish a validation strategy to be performed at the end of experimentation. (4 pages; 10 sources)
    • — Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (6 pages, not including references)
    • — Presentation. At the end of the semester, teams will present their work to the class and broader Berkeley community.
    All reports should use the ACL 2021 style files on Overleaf.

    Policies

    Academic Integrity

    All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks must be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. All homeworks and project deliverables are due at time and date of the deadline. We have a zero tolerance policy for cheating and plagiarism; violations will be referred to the Center for Student Conduct and will likely result in failing the class.

    Students with Disabilities

    Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

    COVID-19

    All students are expected to abide by campus policies regarding COVID-19 including masking and vaccination requirements. While this is an in-person class with daily in-person activities, the lecture component will be recorded and made available through bCourses for viewing after class. If you're feeling sick, stay at home and watch the lecture recording instead of coming to class!