Welcome to CS 5963 / Math 3900! This course is an introduction to data science. The major goals of this course are to learn how to use tools for acquiring, cleaning, analyzing, exploring, and visualizing data; making data-driven inferences and decisions; and effectively communicating results. These will be accomplished through an in-depth sequence of topics which will introduce students to the following data preparation and analysis methods:
- Acquiring data through web-scraping and data APIs
- Cleaning and reshaping messy datasets using methods such as regular expressions or dedicated tools such as open refine
- Exploratory data analysis and visualization
- Rating and ranking
- Clustering and classification
- Network analysis
- Regression and statistical inference
- Natural language processing
A major component of this course will be learning how to use python-based programming tools to apply these methods to real-life datasets.
Students are expected to have some programming experience and have at least taken Calculus I (UU Math 1170, 1210, 1250 1310, 1311 or equivalent). If in doubt, ask one of the instructors. You should also own a notebook computer that you can bring to class.
There is no required textbook for the class. However, students may find it useful to consult the following textbooks for reference.
Data Science from Scratch: First Principles with Python
O’Reilly Media (2015)
Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.
Read for free on Campus
Doing Data Science: Straight Talk from the Frontline
Cathy O’Neil, Rachel Schutt
O’Reilly Media (2013)
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
Blog that was the basis for this book
Learning the Pandas Library: Python Tools for Data Munging, Analysis, and Visualization
CreateSpace Independent Publishing Platform (2016)
Python is one of the top 3 tools that Data Scientists use. One of the tools in their arsenal is the Pandas library. This tool is popular because it gives you so much functionality out of the box. In addition, you can use all the power of Python to make the hard stuff easy!
Learning the Pandas Library is designed to bring developers and aspiring data scientists who are anxious to learn Pandas up to speed quickly. It starts with the fundamentals of the data structures. Then, it covers the essential functionality. It includes many examples, graphics, code samples, and plots from real world examples.
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman
2nd ed., Cambridge University Press (2014)
The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).
The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.
Download from book website
Data Mining: Concepts and Techniques
Jiawei Han, Micheline Kamber, and Jian Pei
3rd ed., Morgan Kaufmann (2011)
Data Mining: Concepts and Techniques provides the concepts and techniques in processing gathered data or information, which will be used in various applications. Specifically, it explains data mining and the tools used in discovering knowledge from the collected data. This book is referred as the knowledge discovery from data (KDD). It focuses on the feasibility, usefulness, effectiveness, and scalability of techniques of large data sets.
An Introduction to Statistical Learning
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Springer Texts in Statistics (2015)
This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.
Read for free
The class meets twice a week for lectures and joint class activities. The class activities are designed to help you master the relevant materials, to work on your homework in groups, and to get you started on your project. The weekly schedule of lectures is posted on the course web site.
Labs are an important aspect of the course, as they are meant to give you the technical skills to successfully complete the homework assignments and projects, complementing the theoretical knowledge taught in lectures. Sections will typically consist of a short presentation and live coding, followed by time to complete some exercises with the help of the instructors. Lab topics and times are announced in the schedule. You are expected to bring your own computer with the neccesary software installed to all labs. Labs can be downloaded here.
The course schedule includes required weekly readings – you are free to study ahead, but the schedule ensures that you are prepared for the activities in class and the homework. The goal of the reading assignments is to familiarize yourself with new terminology and definitions, to learn new design and programming skills, and to determine which part of the subject needs more attention. The homework assignments will contain questions about the mandatory readings. When answering these, please be brief and to the point!
Neither mathematics or computer science are spectator sports; mastery of either subject requires a significant amount of practice! Homework assignments provide an opportunity to practice your programming skills, think about analytical concepts in a new way, and to test your understanding of the material. The homework assignments are also designed to prepare you for your course project. You should view the homework as an opportunity to learn, and not to “earn points”. The homework will be graded holistically to reflect this objective. Homework will be submitted through Canvas.
At the core of this course is a course project. The goal of the project is to analyze a topic of your choosing and document your findings. You will acquire and clean the data, use tools from the class to explore, describe, and analyze the data, and evaluate the results to make a predictions. The path to a good project will involve mistakes and wrong turns. It is therefore important to recognize that mistakes are valuable in finding the path to a solution, but will require a significant amount of time. It is therefore imperative that you begin your project early! The project has an intermediate milestone that will allow you to get feedback and to iterate.
In your project you will work closely with classmates in 2-3 person project teams. You can find more information on the project page.
During lectures no internet-enabled devices (notebooks, smartphones, tablets, etc.) are permitted, unless they are necessary for class activity. While this may sound strict and weird for a CS class, there are good reasons for banning devices in the classroom: messengers and notifications are designed to grab your attention and are de-facto irresistible. Also, note-taking by hand versus on your computer was shown to be more efficient for learning (news story).
Your final grade will be determined by the number of points you collect. You can collect various amounts of points for the different parts of the class:
- Homework: 60%, assessed on your individual submission. The homework assignments are weighted based on difficulty and length (A two week homework counts more than a one week homework).
- Project: 40%, assessed on meeting the project criteria and your peer assessment. The 40% is split between the two milestones and the proposal. 5% are assigned to the proposal, 10% are assigned to your first milestone, 25% to your final submission.
- We will evaluate your work holistically beyond mechanical correctness and focus on the overall quality of the work using the following scale:
100 = Excellent / no mistakes (or really minor)
80 = Good / some mistakes
50 = Fair / some major conceptual errors
20 = Poor / did not finish
0 = Did not participate / did not hand in
A weighted overall average of 10 constitutes a perfect grade and is equivalent to an A. Here are tentative grade ranges for you reference. Note that these might be adjusted.
|A-||> 90 - ≤ 95|
|B+||> 85 - ≤ 90|
|B||> 80 - ≤ 85|
|B-||> 75 - ≤ 80|
|C+||> 70 - ≤ 75|
|C||> 65 - ≤ 70|
|C-||> 60 - ≤ 65|
In addition, the staff will select the top projects through a review and voting process. These projects will be presented during our last lecture, featured on the website, and get up to 12 points for their exceptional work. A small number of projects will win the coveted Best Project prize including Swiss chocolate.
Project Group Peer Assessment
In the professional world, three important features affect your productivity and success: your own effort, the effort of people you depend on, and the way you work together. For this reason we have chosen a team-based approach that values all three of those features. After each team-based project you will provide an assessment of the contributions of the members of your team, including yourself. Your scores on the projects are adjusted up or down depending on the following factors:
- Your teammates’ view of your contributions to the team
- The accuracy of your own assessment of your contributions
- The accuracy of your assessment of each of your teammates’ contributions
Your teammates’ assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall course evaluation.
You are welcome (and encouraged!) to discuss the course’s ideas, material, and homework with others in order to better understand it, but the work you turn in must be your own (or for the project, yours and your teammate’s). For example, you must write your own code, and critically evaluate the results in your own words. You may not submit the same or similar work to this course that you have submitted or will submit to another. Nor may you provide or make available solutions to homeworks to individuals who take or may take this course in the future.
You must acknowledge any source code that was not written by you by mentioning the original author(s) directly in your source code (comment or header) and include a link to the source. However, you are encouraged to use libraries, unless explicitly stated otherwise!
You may use examples you find on the web as a starting point, provided its license allows you to re-use it. You must quote the source using proper citations (author, year, title, time accessed, URL) both in the source code and in any publicly visible material. You may not use existing complex combinations or large examples.
Missed Activities and Assignment Deadlines
Projects and homework must be turned in on time, with the exception of late days for homework as stated below. Because of the emphasis on teamwork, it is important that everybody attends and proactively participates in class. Due to the collaborative nature of the activities, it is not possible to make up any missed team activities, such as project work. We understand, however, that certain factors may occasionally interfere with your ability to participate or to hand in work on time. If that factor is an extenuating circumstance, we will ask you to provide documentation directly issued by the University, and we will try to work out an agreeable solution with you (and your team).
You can turn in your assignment up to two days late, however, for each day that an assignment is turned in late we will deduct 10% from the total possible points. That is, one-day late is 10% off and two-days is 20% off. So if your assignment is two day late, the max number of points (out of 10) that you can receive is 8. By permission of the instructor in extenuating circumstances, you may use more that two late days, however, the 10% rule per day will still apply. If you have a verifiable medical condition or other special circumstances that interfere with your coursework please let us know as soon as possible.
It is very important to us that all assignments are properly graded. If you believe there is an error in your assignment grading, please submit an explanation via email to us (the staff mailing list) within 7 days of receiving the grade. No regrade requests will be accepted orally, and no regrade requests will be accepted more than 7 days after you receive the grade for the assignment.
If you have a documented disability (physical or cognitive) that may impair your ability to complete assignments or otherwise participate in the course and satisfy course criteria, please meet with us at your earliest convenience to identify, discuss, and document any feasible instructional modifications or accommodations.
This class occasionally uses material developed for Harvard’s CS 109, taught by Hanspeter Pfister, Joe Blitzstein, Rhaul Dave, and Verena Kayning. We have drawn on materials and examples found online and tried our best to give credit by linking to the original source. You can find these credits mainly by direct links to the sources from the slides (e.g., hyperlinked from images). Please contact us if you find materials where the credit is missing or that you would rather have removed.
User Notice for Copyrighted Materials on Course Websites
This course website, and all original content provided as part of this course is licensed under the creative commons cc by license. Other content such as text, images, graphics, audio and video clips, (collectively, the “Content”), are protected by copyright law. In some cases, the copyright is owned by third parties, and we are making the third-party content available to you under the fair use doctrine. Fair use permits only certain limited uses of the Content.
You may use this Content only for your personal, noncommercial educational and scholarly use.