Info

This course examines the use of natural language processing as a set of methods for exploring and reasoning about text as data, focusing especially on the applied side of NLP — using existing NLP methods and libraries in Python in new and creative ways (rather than exploring the core algorithms underlying them; see Info 159/259 for that).

Students will apply and extend existing libraries (including pytorch, spacy and huggingface) to textual problems. Topics include text-driven forecasting and prediction (using text for problems involving classification or regression); exploratory data analysis; experimental design; the representation of text, including features derived from linguistic structure (such as named entities, syntax, and coreference) and features derived from low-dimensional representations of words, sentences and documents; exploring textual similarity; and information extraction (extracting relations between entities mentioned in text). We'll focus extensively on the underlying structure and affordances of large language models, especially as they can be used for text-as-data problems.

This is an applied course; each class period will be divided between a short lecture and in-class lab work using Jupyter notebooks (roughly 50% each). Students will be programming extensively during class, and will work in groups with other students and the instructors. Students must prepare for each class and submit preparatory materials before class. Attendance is required.

This course is targeted to graduate students across a range of disciplines (including information, English, sociology, public policy, journalism, computer science, law, etc.) who are interested in text as data and can program in Python but may not have formal technical backgrounds.

Office hours

David Bamman:
  • Wednesday, 10am-noon (312 South Hall)

  • Naitian Zhou:
  • Monday, 1-2pm (205 South Hall)
  • Prerequisites

    Graduate student status; proficient programming in Python (programs of at least 200 lines of code), equivalent to INFO 206A/B.

    Colab

    Most assignments will be in the form of Jupyter notebooks that can be run locally on your computer or on Google Colab. Some assignments (especially those involving LLMs) will require access to a GPU; you can expect to require Colab Pro-level ($10) access for about a month.

    Syllabus

    (Subject to change.)

    Week Date Topic Core Readings Supplementary
    18/28 Introduction Nguyen et al. 2020 Ziems et al. 2023
    29/2 Words SLP ch. 2; Potts Mielke et al. 2021; Manshel 2020; Fischer-Baum et al. 2020
    9/4 Finding distinctive terms Kilgarriff 2001 (up to p. 248); Monroe et al. 2009 (up to 3.3) Jurafsky et al. 2014; Mosteller and Wallace 1964
    39/9 Lexical semantics/word embeddings SLP3 ch. 5 Shechtman 2021; Soni et al. 2021
    9/11 Bias in word embeddings An et al. 2018 Kozlowski et al. 2019
    49/16 EDA: Topic models Blei 2012 Klein 2020; Antoniak et al. 2019; Demszky et al. 2019; Grimmer 2010
    9/18 Annotating data Krippendorff 2018, "Reliability" (bCourses) Vidgen et al. 2021; Voigt et al. 2017
    59/23 Text classification: logistic regression SLP ch. 4; Scikit-learn tutorial Zhang et al. 2018; Broadwell et al. 2017
    9/25 Hypothesis testing 1 NLTK 6; Dror et al. 2018 Field et al. 2021; Blodgett et al. 2020; Denny et al. 2018
    69/30 Hypothesis testing 2: bootstrap; permutation tests A Gentle Introduction to the Bootstrap Method Antoniak and Mimno 2017
    10/2 Language models: basics SLP ch. 3 Danescu-Niculescu-Mizil et al. 2013
    710/7 Transformers SLP ch. 8 Gururangan et al. 2022; Chang et al. 2023
    10/9 Masked language models SLP ch. 10
    810/14 EDA: Text clustering Blog post; Scikit-learn clustering Nelson 2020; Wilkens 2016; Viswanathan et al. 2023
    10/16 Named entity recognition SLP3 ch. 17.3 Erlin et al. 2021; Evans and Wilkens 2018
    910/21 Large language models 1 SLP ch. 7 Liu et al. 2021
    10/23 Large language models 2 SLP ch. 7; Prompt Engineering Guide Jurgens et al. 2023
    1010/28 Large language models 3 SLP ch. 11
    10/30 Large language models 4 Wang et al. 2024; Park et al. 2023
    1111/4 WordNet SLP3 I; NLTK 2 Tenen 2018
    11/6 Dependency parsing SLP3 SLP ch. 19 Reeve 2017; Underwood et al. 2018
    1211/11 Holiday -- no class
    11/13 Coreference resolution SLP ch. 23 Sims and Bamman 2020
    1311/18 Information extraction SLP ch. 20 Keith et al. 2017
    11/20 Sequence alignment Needleman–Wunsch; Smith–Waterman So et al. 2019; Wilkerson et al. 2014
    1411/25 Multimodal models
    11/27 Holiday -- no class
    1512/2 Final project poster session
    12/4 Final project poster session

    Grading

    10% Participation
    15% Reading responses
    25% Programming Homeworks
    50% Project:
          1% Greenlit project idea
          4% Proposal/literature review
          15% Midterm report
          25% Final report
          5% Presentation

    We will typically have a short programming homework/reading response due before each class. Each homework will be graded as {check+, check, check-, 0}. We will drop your 2 lowest-scoring programming homeworks. You'll also have regular reading responses, where you read a paper in NLP and write a few paragraphs describing what's original or interesting about it. The purpose of reading is not only to become familiar with the affordances of NLP -- how others have used it to discover new knowledge -- but also to internalize what academic writing looks like, as you create that kind of work yourself. We will drop your 1 lowest reading response when calculating your final grade. No late programming homeworks or reading responses will be accepted.

    The final project deliverables you create should be polished and reflect that understanding of what makes for a good NLP paper -- not simply stylistically, but in the organization of information (including contextualizing a contribution within a landscape of related work, motivating experimental design choices, and substantiating claims as true). You'll need to cultivate these skills through lots of reading, so don't just outsource it to ChatGPT! The final project proposal, midterm report, and final report may be submitted up to 24 hours late without penalty; late submissions for these components after the 24-hour grace period will be penalized 10% of the total possible points for that assignment per day, up to 5 days late (e.g. 1.5 points per day for a late midterm report, which has a total of 15 points); late submission after 5 days will receive 0 credit. The final project presentation will not be accepted late.

    Attendance is required in this class and is reflected in your participation grade; each class will usually have some activity to work on and submit in-class that illustrates the concepts that we are covering. I know things come up that prevent attending every class; to set expectations: missing up to 3 classes will not adversely affect your grade; missing more than 8 classes (roughly 1/3 of class) will lead to a participation grade of 0. If you miss all of the first 3 classes, we’ll move to drop you from the class to make space for others.

    Project

    Info 256 will be capped by a semester-long project (involving one to three students), involving natural language processing in support of an empirical research question. The project will be comprised of four components:

    • — Greenlit project idea (1 point). Students will suggest a research topic and receive a green light to carry it out for a proposal.
    • — Project proposal and literature review (4 points). Students will propose the research question to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community.
    • — Midterm report. By the middle of the course, students should present initial experimental results and establish a validation strategy to be performed at the end of experimentation.
    • — Final report. The final report will include a complete description of work undertaken for the project--including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis--and will submit code and data to reproduce the results in the paper. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets).
    • — Presentation. At the end of the semester, teams will present their work to the class and broader Berkeley community in a poster session in-class on 12/2 and 12/4.
    All reports should use the ACL style files on Overleaf.

    Policies

    Academic Integrity

    All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks must be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. All homeworks and project deliverables are due at time and date of the deadline. We have a zero tolerance policy for cheating and plagiarism; violations will be referred to the Center for Student Conduct and will likely result in failing the class.

    Students with Disabilities

    Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

    AI Assistants

    This is a class on NLP, and LLMs are NLP technologies; one goal of this course is to understand how to use LLMs sensibly while still prioritizing learning outcomes. So what are the learning outcomes of this class?

    • — Above all, you are cultivating broad knowledge of the landscape of methods available to use in NLP for answering questions involving text as data, and developing an understanding of what methods are appropriate to use for a given task. When given a new task that no one has seen before, you should be able to know how to think about it and decide what methods are good and which ones are bad. If you ask an LLM "what should I do?", you need to be able to assess the quality of the response. You are cultivating a sense of taste that comes from experience, so you need to be sure to give yourself that experience thinking through problems.
    • — You are getting experience implementing those ideas in code. AI Assistants (e.g., Copilot, etc.) can dramatically help with this and provide a natural sandbox to implement ideas and see if they work. A learning outcome for this class is not to teach you Python, so I don't care if you're using Copilot to tell you how to read in a CSV file or sort a Python dict by value. But you should understand what it's doing, be able to answer questions about it and be empowered to edit it to do something different. You are always ultimately responsible for the code you write.
    • — You will read a range of papers in NLP to see how creative people have been in applying it to measurement problems. The goal here (as noted above) is to both give you a sense of the landscape of research that you will be working within (many people have thought about your ideas before, so you should see what they have found), but also how research is structured -- e.g., how research questions, models, and validation all fit together. You are again cultivating your sense of research taste as you read these papers, and this is only developed through lots of experience and reading.

    I highly recommend that you don't try to game homeworks and reading responses with LLMs. They're not there as busywork, but to give you repeated experience cultivating the tastes mentioned above. Work through them. Feel free to use automatic coding assistants for some low-level aspects of this work (as you'd certainly do at your job/research in the future), but be sure to work through the core of the challenge yourself in order to get that experience solving these kinds of problems. We'll see how you've been able to cultivate that taste directly in the final projects you submit (which bears most of the grading weight); but consider beyond this class in the future -- as you talk with people who work in NLP (at conferences or job interviews), will your experience come through in those moments without ChatGPT to help you?

    To make sure we have a clean dividing line between fair and unfair use of automatic writing/coding assistants, you retain ultimate responsibility the behavior of any automatic assistant you use. If an LLM plagiarizes the text that someone else wrote (and you then submit that text as your own), you're the one who's ultimately accountable for plagiarizing. Even if assisted by code suggestions, any work you submit should still be the product of your own creation, representing your own ideas, code and words, and submissions that rely too heavily on AI tools will not be graded very favorably. You are free to use LLMs to rephrase text that you are the author of, or to help brainstorm ideas, but you must be the author of any text you submit (e.g., reading responses or project reports), and you cannot simply be editing the text that originated in an LLM. You must know what all code you submit does; if we ask and you don't know, your grade will be influenced by it.

    For the use of any AI assistance in your final project deliverables, you should follow the ACL Policy on AI Writing Assistance.