Welcome to introduction to data science (COMP 5360 / MATH 4100)! The major goals of this course are to learn how to use tools for acquiring, cleaning, analyzing, exploring, and visualizing data; making data-driven inferences and decisions; and effectively communicating results. These will be accomplished through course activities on the following data science topics:
- Introduction to data analysis tools in Python
- Descriptive statistics
- Data structures with Pandas
- Introductory hypothesis testing and statistical inference
- Web scraping and data acquisition via APIs
- Linear regression
- Classification methods, including logistic regression, k-nearest neighbors, decision trees, and support vector machines
- Data visualization
- Clustering methods
- Dimensionality reduction, including principle component analysis
- Network analysis
- Rating, ranking, and elections
- Cleaning and reformating messy datasets using regular expressions or dedicated tools such as open refine
- Natural language processing
- Ethics of big data
A major component of this course will be learning how to use python-based programming tools to apply these methods to real-life datasets.
At the end of the course, a student should be able to:
- Acquire data through web-scraping and data APIs
- Clean and reshape messy datasets
- Use exploratory tools such as clustering and visualization tools to analyze data
- Perform linear regression analysis
- Use methods such as logistic regression, nearest neighbors, decision trees, and support vector machines to build a classifier
- Apply dimensionality reduction tools such as principle component analysis
- Perform basic analysis of network data
- Evaluate outcomes and make decisions based on data
- Effectively communicate results
Completed at least one of the following:
- MATH 1170 - Calculus for Biologists I (4)
- MATH 1210 - Calculus I (4)
- MATH 1250 - Calculus for AP Students I (4)
- MATH 1310 - Engineering Calculus I (4)
- MATH 1311 - Accelerated Engineering Calculus I (4)
Recommended Prerequisites/Corequisites: Some programming experience with Python or a similar language, as demonstrated by the ability to write short programs incorporating variables, lists and strings, loop structures, and data file input and output. More advanced mathematics, such as linear algebra or introductory statistics, is also recommended.
If in doubt, ask one of the instructors. You should also own a notebook computer that you can bring to class.
There is no required textbook for the class. However, students may find it useful to consult the following textbooks for reference.
Data Science from Scratch: First Principles with Python
O’Reilly Media (2015)
Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.
Doing Data Science: Straight Talk from the Frontline
Cathy O’Neil, Rachel Schutt
O’Reilly Media (2013)
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
Blog that was the basis for this book
Learning the Pandas Library: Python Tools for Data Munging, Analysis, and Visualization
CreateSpace Independent Publishing Platform (2016)
Python is one of the top 3 tools that Data Scientists use. One of the tools in their arsenal is the Pandas library. This tool is popular because it gives you so much functionality out of the box. In addition, you can use all the power of Python to make the hard stuff easy! Learning the Pandas Library is designed to bring developers and aspiring data scientists who are anxious to learn Pandas up to speed quickly. It starts with the fundamentals of the data structures. Then, it covers the essential functionality. It includes many examples, graphics, code samples, and plots from real world examples.
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman
2nd ed., Cambridge University Press (2014)
The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).
The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.
Download from book website
Data Mining: Concepts and Techniques
Jiawei Han, Micheline Kamber, and Jian Pei
3rd ed., Morgan Kaufmann (2011)
Data Mining: Concepts and Techniques provides the concepts and techniques in processing gathered data or information, which will be used in various applications. Specifically, it explains data mining and the tools used in discovering knowledge from the collected data. This book is referred as the knowledge discovery from data (KDD). It focuses on the feasibility, usefulness, effectiveness, and scalability of techniques of large data sets.
An Introduction to Statistical Learning
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Springer Texts in Statistics (2015)
This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.
Read for free
The class meets twice a week for lectures and joint class activities. The weekly schedule is posted on the course website.
Lectures contain both theoretical knowledge and technical components to give you the skills to successfully complete the homework assignments and projects. Lectures will often consist of a short presentation and live coding, followed by time to complete some exercises with the help of the instructors. Lecture topics and times are announced on the schedule. You are expected to bring your own computer with the neccesary software installed to all labs. Labs can be downloaded here.
Class activities are designed to help you master the relevant materials, to work on your homework in groups, and to help you start your project.
The course schedule includes required weekly readings – you are free to study ahead, but the schedule ensures that you are prepared for the activities in class and the homework. The goal of the reading assignments is to familiarize yourself with new terminology and definitions, to learn new design and programming skills, and to determine which part of the subject needs more attention. The homework assignments will contain questions about the mandatory readings. When answering these, please be brief and to the point!
Neither mathematics or computer science are spectator sports; mastery of either subject requires a significant amount of practice! Homework assignments provide an opportunity to practice your programming skills, think about analytical concepts in a new way, and to test your understanding of the material. The homework assignments are also designed to prepare you for your course project. You should view the homework as an opportunity to learn, and not to “earn points”. The homework will be graded holistically to reflect this objective. Homework will be submitted through Canvas.
Homework assignments are due on Fridays, 11:59 pm ET, unless stated otherwise. For due dates see the schedule.
The assignments are published on GitHub.
Homework submissions will be handled trough Canvas. Submit a zip file that includes all files needed to execute the homework.
Homework Rules & Hints
A couple of important rules to make our lives easier:
- See the syllabus for our collaboration policy and to learn how to quote your sources.
- We recommend you use version control while you are working on your homework using a private repository. GitHub offers free private repositories for students, BitBucket also provides free private repositories. Every time you finish a chunk of work, or when you are done for the day, push your changes to a repository. This will avoid data loss, even if your house burns down, and you always will be able to recover what you already pushed. Make sure that your work is NOT PUBLICLY ACCESSIBLE.
- We will grade your work based on the IPython, Jupyter, and library version introduced in the labs. Make sure that your code is compatible with these versions.
At the core of this course is a course project. The goal of the project is to analyze a topic of your choosing and present your findings. You will acquire and clean the data; use tools from the class to explore, describe, and analyze the data; and evaluate the results to make predictions. The path to a good project will involve mistakes and wrong turns. It is important to recognzie that these misteps are invaluable on the path to a great project, but will require a significant amount of time. It is therefore imperative that you begin your project early! The project has an intermediate milestone that will allow you to get feedback and to iterate. In your project, you will work closely with classmates in 2-3 person teams. You can find more information on the project page.
During lectures no internet-enabled devices (notebooks, smartphones, tablets, etc.) are permitted, unless they are necessary for class activity. While this may sound strict and weird for a CS class, there are good reasons for banning devices in the classroom: messengers and notifications are designed to grab your attention and are de-facto irresistible. Also, note-taking by hand versus on your computer has been shown to be more efficient for learning (news story). Read more on the issue by Clay Shirky (Prof. and writer on social and economic effects of Internet technologies) and Dan Rockmore (Prof. of Computer Science at Dartmouth).
Your final grade will be determined by your performance on the various aspects of the class:
- Homework: 60%, assessed on your individual submission. The homework assignments are weighted based on difficulty and length (A two week homework counts more than a one week homework).
- Project: 40%, assessed on meeting the project criteria and your peer assessment. The 40% is split between the two milestones and the proposal. 5% are assigned to the proposal, 10% are assigned to your first milestone, 25% to your final submission.
We will evaluate your work holistically beyond mechanical correctness and focus on the overall quality of the work.
In addition, the instructors will select the top projects through a review and voting process. These projects will be featured on the website and the project teammates will receive bonus points. In addition, the teammates will be awarded swiss chocolate!
Project Group Peer Assessment
In the professional world, three important features affect your productivity and success: your own effort, the effort of people you depend on, and the way you work together. For this reason we have chosen a team-based approach that values all three of those features. After each team-based project you will provide an assessment of the contributions of the members of your team, including yourself. Your scores on the projects are adjusted up or down depending on the following factors:
- Your teammates’ view of your contributions to the team
- The accuracy of your own assessment of your contributions
- The accuracy of your assessment of each of your teammates’ contributions
Your teammates’ assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall course evaluation.
Collaboration, Cheating and Plagiarism Policy
You are welcome to discuss the course’s ideas, material, and homework with others in order to better understand it, but the work you turn in must be your own (or for the project, yours and your teammate’s). For example, you must write your own code, design your own visualizations, and critically evaluate the results in your own words. You may not submit the same or similar work to this course that you have submitted or will submit to another. Nor may you provide or make available solutions to homeworks to individuals who take or may take this course in the future.
In homeworks you must not use libraries or code provided on the internet except when explicitly permitted in the instructions.
In your project, you may use limited parts of code found online, provided its license allows you to re-use it. You are free to use general purpose frameworks or libraries (e.g., Node.js, Bootstrap, JQuery, etc.) You may not use plotting libraries such as plot.ly. You must acknowledge any source code that was not written by you by a proper citation (author, year, title, time accessed, URL) directly in your source code (comment or header) and provide a link to the source. You can also acknowledge sources in a README.txt file if you used whole classes or libraries. You also must include these references clearly visible on your project website.
We will use both manual and automatic methods to check your submissions for plagiarism and will also check against online sources and submissions from previous years. For details on the policy, please refer to the School of Computing Cheating Policy. Plagiarism will lead to a failing grade in this course, two such infractions will lead to a ban on all CS programs.
Missed Activities and Assignment Deadlines
All submissions related to projects must be turned in on time. Homeworks are subject to the late day policy stated below. We understand, however, that certain factors may occasionally interfere with your ability to participate or to hand in work on time. If that factor is an extenuating circumstance such as a medical condition, we will ask you to provide documentation directly issued by the University, and we will try to work out an agreeable solution with you (and your team).
You can turn in your homework assignments up to two days late, however, for each day that an assignment is turned in late we will deduct 10% off the total possible points. That is, one-day late is 10% off, two-days is 20% off. So, if your assignment is two day late, the max number of points (out of 10) that you can receive is 8. By permission of the instructor in extenuating circumstances, you may use more than two late days, however, the 10% rule per day will still apply.
It is very important to us that all assignments are properly graded. If you believe there is an error in your assignment grading, please submit an explanation via email to us (the staff mailing list) within 7 days of receiving the grade. No regrade requests will be accepted orally, and no regrade requests will be accepted more than 7 days after you receive the grade for the assignment.
The Americans with Disabilities Act.
The University of Utah seeks to provide equal access to its programs, services, and activities for people with disabilities. If you will need accommodations in this class, reasonable prior notice needs to be given to the Center for Disability Services, 162 Olpin Union Building, (801) 581-5020. CDS will work with you and the instructor to make arrangements for accommodations. All written information in this course can be made available in an alternative format with prior notification to the Center for Disability Services.
Respect for Diversity
It our my intent that students from all diverse backgrounds and perspectives be well-served by this course, that students’ learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength and benefit. It out intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Your suggestions are encouraged and appreciated. Please let us know ways to improve the effectiveness of the course for you personally, or for other students or student groups.
Student Name and Personal Pronoun
Class rosters are provided to the instructor with the student’s legal name as well as “Preferred first name” (if previously entered by you in the Student Profile section of your CIS account). While CIS refers to this as merely a preference, I will honor you by referring to you with the name and pronoun that feels best for you in class, on papers, exams, group projects, etc. Please advise me of any name or pronoun changes (and please update CIS) so I can help create a learning environment in which you, your name, and your pronoun will be respected.
Addressing Sexual Misconduct.
Title IX makes it clear that violence and harassment based on sex and gender (which Includes sexual orientation and gender identity/expression) is a civil rights offense subject to the same kinds of accountability and the same kinds of support applied to offenses against other protected categories such as race, national origin, color, religion, age, status as a person with a disability, veteran’s status or genetic information. If you or someone you know has been harassed or assaulted, you are encouraged to report it to the Title IX Coordinator in the Office of Equal Opportunity and Affirmative Action, 135 Park Building, 801-581-8365, or the Office of the Dean of Students, 270 Union Building, 801-581-7066. For support and confidential consultation, contact the Center for Student Wellness, 426 SSB, 801-581-7776. To report to the police, contact the Department of Public Safety, 801-585-2677(COPS).
College of Engineering Guidelines: Academic Calendar, Policies
Please review the college of engineering guidelines, which you can find here.
This class occasionally uses material developed for Harvard’s CS 109, taught by Hanspeter Pfister, Joe Blitzstein, Rhaul Dave, and Verena Kayning. We have drawn on materials and examples found online and tried our best to give credit by linking to the original source. You can find these credits mainly by direct links to the sources from the slides (e.g., hyperlinked from images). Please contact us if you find materials where the credit is missing or that you would rather have removed.
User Notice for Copyrighted Materials on Course Websites
This course website, and all original content provided as part of this course is licensed under the creative commons cc by license. Other content such as text, images, graphics, audio and video clips, (collectively, the “Content”), are protected by copyright law. In some cases, the copyright is owned by third parties, and we are making the third-party content available to you under the fair use doctrine. Fair use permits only certain limited uses of the Content. You may use this Content only for your personal, noncommercial educational and scholarly use.