Big Data and Language Technologies Seminar

General Information

Lecturer Jun.-Prof. Dr. Martin Potthast
Lab Advisors Lukas Gienapp, Niklas Deckers
Workload 2 SWS Seminar Lecture, 2 SWS Lab
Seminar Lecture Monday, 13:30 - 15:00, starting 04.04.2022 in presence, SG 3-12 HSG 6
Lab Monday, 15:30 - 16:45, starting 04.04.2022 in presence, SG 3-12 HSG 6
Contact Email, or via Discord server "webis-lectures"
Description Information on the web is growing at an exponential pace, courtesy of social media platforms, blogs, and news. Such large scale data sources call for high-end, scalable, distributed architectures for cognitive analysis, which shape the business decisions of many industries. In addition, deep learning has been propelled into mainstream and is now accessible to researchers and companies alike, thanks to tools such as TensorFlow, PyTorch. The Webis research group operates large-scale high-performance compute infrastructure (totaling more than 3000 CPU cores, 10+ Petabytes of storage, and 24 high-end GPUs), which will be put to use in the course of this seminar. Students will receive application-oriented training in Big data and deep learning frameworks, language technologies, and explore interesting research questions. This seminar requires good skills in both programming (Python) and algorithms.
Requirements

Please note that this course requires prior Python programming experience. In addition, some familiarity with Linux environments, and knowledge of machine learning basics, is highly recommended. To help you gauge your prior subject knowledge, we've provided a set of self-assessment questions below. Read through the self-assessment questions, and take note of how many you can answer in the affirmative, and how many answers you know without having to look them up.

Self-assessment questionnaire

This questionnaire is not perfectly suitable for studying in order to catch up; however, the questions should cover a broad range of topics around our course's scope and highlight potential weak points.

Python

  • Have you worked with Python 3 before?
  • Do you know how to run Python scripts?
  • Did you install pip packages before?
  • Did you use Jupyter Notebooks before?
  • Do you know how to assign variables?
  • How to use for and while loops?
  • How to define functions with default arguments?
  • How to use *args, **kwargs and variable unpacking?
  • When does call by object reference happen and why might it lead to unexpected problems?
  • How is a class defined and instantiated?
  • How does inheritance work? Why use super()?
  • How is a generator defined and iterated?
  • What is a list comprehension?
  • How can the entries of a list be doubled using a list comprehension?
  • How to get only even entries of a list using a list comprehension?
  • How to get the Collatz successor of each number in a list using a list comprehension and a conditional expression?
  • How is a dictionary defined? How can you get a value?
  • What ways can a dict be iterated?
  • What are dictionary comprehensions?
  • What are lambda functions? Why are they used?
  • What is map()?
  • How to use the key argument in sorted()?
  • Have you worked with numpy before?
  • What are ndarrays? Why use them?
  • What are shape and dtype? How can both be altered?
  • What is the difference between reshape and transpose? When can you safely use reshape?
  • How to matrix multiply two ndarrays?
  • What is the axis argument in ndarray.sum()?
  • How to read all lines from a file?
  • What is JSON? How to read/write JSON data in Python?

Linux Command line/Remote work

  • How to compose shell commands with pipes and input/output redirection?
  • How to pass each line from an input file to the same command as an argument, and run all resulting processes in parallel?
  • How to find out how many lines in a text file contain the strings "cat" or "hat"?
  • How to download and then unpack a zip file from the command line?
  • How to log into a remote machine via SSH?
  • What is an SSH public key and how to create one?
  • How to make sure a program continues to run after you log out?
  • How to transfer a file to a remote machine using only an SSH connection?

Machine Learning Basics

  • What is the difference between supervised and unsupervised learning?
  • What is the difference between regression and classification?
  • What is gradient descent, and how does it work?
  • What are precision, recall and accuracy, and how are they computed?
  • What is overfitting, why is it a problem, and how to detect and avoid it?
Deliverables In order to successfully complete this course, you will have to:
  • Actively participate in seminar sessions
  • Complete a half-semester long group research project (topics to be assigned a few weeks into the seminar)
  • After selecting your project topic, submit a short exposé descibing your goals and work plan (1-2 pages)
  • Give a group presentation discussing your progress (5 minutes + discussion)
  • At the end of the semester, submit a research report discussing your approach and results (4 pages double column + references)

Announcements

  • Due to unforeseen circumstances, we have to once again cancel next Mondays (23.05) lecture on Big Data and Language Technologies. Its contents will be instead given the week after. The lab session (Topic: Intro to Prompt Engineering) will take place as usual from 15:30 on.

Organization

  • General Note This is a joint seminar between Uni Weimar and Uni Leipzig. We will use our camera equipment to link both seminars.
  • Leipzig-specific Remark This course will have to be credited as Citizen Science on your transcripts as it represents a topic-wise variation of that course. This also means that it cannot be credited twice if you already completed Citizen Science.
  • Communication
    • Lecture website - materials and announcements will be uploaded on this website.
    • Discord - there is a Discord server for this lecture to ask questions and engage in discussion. Check your mails for an access code. Please join the server and choose a Nickname such that we can identify you (at least surname).
    • Email - important announcements will be sent out via mail.

Lecturenotes

  • Big Data and Language Technologies » Introduction » Organization, Literature [slides] [video (LE)] [video (WE)]
  • Big Data and Language Technologies » Introduction » Introduction [slides] [video (LE)] [video (WE)]
  • Big Data and Language Technologies » Machine Learning Basics » Regression [slides] [video]
  • Big Data and Language Technologies » Machine Learning Basics » Gradient Descent [slides] [video]
  • Big Data and Language Technologies » Machine Learning Basics » Recurrent Neural Networks [slides] [video]

Lab Sessions

Date Title Description Materials Deliverables Stream
04.04.2022 Deep Learning in Python (Session 1)
  • Introduction & Lab Setup
  • Keras Basics
11.04.2022 Deep Learning in Python (Session 2)
  • Tensorflow Datasets
  • Custom loss functions
  • Custom training loops
  • TensorBoard
18.04.2022 No Session (Easter Monday)
25.04.2022 Deep Learning in Python (Session 3)
  • Huggingface & pretrained models
[lab]
02.05.2022 Deep Learning on SLURM (Session 1)
  • 20NG Dataset
  • Local model development
  • SLURM basics
09.05.2022 Deep Learning on SLURM (Session 2)
  • SLURM deployment
  • Parameter sweeping
Set up Cluster Access
16.05.2022 Project Fair
  • Introduction of available project topics
23.05.2022 Prompt Engineering (Session 1)
  • Prompt Engineering
  • OpenAI GPT-3 playground
  • Google BIG-Bench
30.05.2022 Prompt Engineering (Session 2)
  • Self-Deployed GPT-2 on SLURM
  • OpenAI GPT-3 API
06.06.2022 No Session (Whit Monday)
13.06.2022 Prompt Engineering (Session 3)
  • Prompt Engineering Projects
Prompt Engineering Presentations
20.06.2022 Q&A Session Project Exposé
27.06.2022 Group Meetings
  • Individual meetings with groups
04.07.2022 Mid-Term Presentations
  • Group project presentations
Project Presentation
11.07.2022 Q&A Session
29.08.2022 Project Deadline Hand in your report in PDF format by eMail. Cutoff is 22:00 CEST Project Report