MA 415/615 Fall 2021
MA 415/615 Data Science in R
Introduction to R, the computer language written by and for statisticians. Emphasis on data exploration, statistical analysis, problem solving, reproducibility, and multimedia delivery. Effective Fall 2020, this course fulfills a single unit in the following BU Hub area: Critical Thinking.
This will be my fifth semester teaching this course. As with the previous two semesters, the course will be a flipped classroom, with pre-recorded lectures being posted to watch at home and in class we will focus on examples, case studies, and group activities. A core part of the course is a semester long group project where you will be analyzing data sets of your choice and building a website using Blogdown to illustrate your findings.
- Lectures MWF 9:05 AM - 9:55 AM
- PHO 201
All times are Eastern Time.
Prof. Daniel Sussman
Data Science can be described as a more modern spin on Statistics, more concerned with the analysis of large, varied, and rich datasets. For this reason, it has gathered interest from many domain specific disciplines and acquired a stronger computational focus. The goal of this course is to provide an introduction to Data Science that covers the most typical analyses, including data wrangling — cleaning and tidying — exploratory data analysis (EDA), data visualizations, modelling, and, finally, effective communication of results.
To perform these tasks, we will be using R, a programming language specially geared to data analysis, with a broad collection of specialized packages (add-ins), and its most popular interface, RStudio. Because students develop many written products, including code and documentation, we also cover a version control system called Git.
R is open source and can be found at R project. R’s official package repository is CRAN. RStudio has an open source version and is developed by its namesake company; check for a free version at RStudio.com.
This course promotes a hands-on approach where students practice concepts during lecture and lab sessions, and work on the delivery of finalized products covering data analysis, presentations, and reproducible research. Students are evaluated based on homework assignments, lab reports, and a final project.
Basic programming (at the level of CS 111) and basic statistics and probability (at the level of MA 113/115/213).
|Intro and setup
|Data visualization [R3]
|Workflow: scripts and projects [R6,8]
|Data transformation [R4,5]
|Exploratory data analysis [R7]
|Data import [R10,11]
|Tidy and relational data [R12,13]
|Strings and factors [R14,15]
|Factors and dates [R15,R16]
|Functions, vectors and iterators [R19–21]
|Modelling: linear and logistic regressions
|Interactive comm. with Shiny
|Web scraping and databases
Table: Tentative class schedule. References within  are chapter readings from R4DS.
The course is roughly divided in four parts: weeks 1–4 cover data exploring, weeks 5–8 are about data wrangling, in weeks 9–12 we discuss programming and modeling, and weeks 13–15 have special topics and project presentations.
We adopt the following reference for most of the course, as can be seen from the readings in the syllabus:
- Garrett Grolemund and Hadley Wickham^[Hadley is one of the most active R contributors, having co-authored the
tidyverse, a collection of modern R packages we will be using extensively in this course.]. R for Data Science (R4DS). O’Reilly Media. First edition. ISBN: 978-1491910399.
The later topics in the course are covered from class notes and selected readings. The textbook contains many other references that cover from more basic concepts to more advanced methods. We also recommend the (short) exercises in R4DS as a way to practice the concepts in smaller chunks. Moreover, the R community is very active and provides documentation, working examples of package applications, and online discussions.
R4DS is available as an online reference.
This course is very hands-on, so evaluation will be based on homework assignments, blog posts, class activities, and a final project. Homework assignments are individual, following on classroom work and focus on coding, data analysis, and documentation explaining all the work. In-class activities give students the practical experience needed for real world data analyses and, in particular, for the course’s final project.
For homework assignments and lab reports we provide templates that are to be filled as the students learn to write code and documentation and to control versions. However, less scaffolding is offered as the semester progresses!
The final project integrates the main topics in the course and consists of a more significant, realistic data analysis project. It should be comprehensive, from data scraping, wrangling, visualizing, summarizing to a consistent deliverable. The project should, in summary, answer a relevant question of interest with data storytelling.
Students will tackle in-class activities and the project in teams, to exercise collaborative work. Teams will consist of 4 to 5 participants. All team members are expected to contribute equally to the completion of each lab report and project. Each team presents their final product in the last week of classes.
Homework assignments represent 30% of the final grade, lab reports respond for further 30%, and the final project covers the remaining 40% of the course grade^[$G = 0.3 \times H + 0.3 \times L + 0.4 \times P$]. Both homework and lab report with the lowest scores for each student will be dropped when computing homework and lab grades.
- Please refer to either Blackboard or Slack for course materials.
- All lecture recordings will be posted on Blackboard.
- Grades are on Blackboard but grade feedback (and other feedback) will be on Github.
- Most communication will be over Slack.
- Please contact me over Slack (rather than email).
- In order to keep things organized, make sure to make posts in the correct channel.
- Also, post in the thread corresponding to a given lecture or assignment if the question is specific to that assignment.
In this course class time—either in lecture or lab—is designed to be as interactive as possible. The reasoning is that, like Math, one can only really learn programming by practicing, so we provide as many activities as possible. We also encourage Participation in class, especially when helping fellow students!
Late homeworks and lab reports will only be accepted up to 24 hours after their deadline but will carry a -20% penalty. Later homeworks and lab reports and late final projects will not be accepted.
Diversity and Inclusiveness
We welcome students from diverse backgrounds, perspectives, experiences, and identities—including gender identity, sexuality, disability, age, socioeconomic status, ethnicity, religion, and culture. We intend to serve well all students in this course, to address learning needs, to present materials that are respectful of diversity, and to foster an inclusive learning environment by sharing the view that diversity brings strengths and benefits.
- Students should not share code with other student in individual assignments or with other teams in team assignments. Students are, however, welcome to discuss problems together and ask for advice.
- Any code or argument based on online resources, including StackOverflow, should be acknowledged and referenced. Otherwise, assignments containing uncited material will be treated as plagiarism^[This tends to be a common temptation, given the wealth of online resources.].
- Cheating on exams or plagiarism on assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the Conduct Code, and will not be tolerated.