MA 415/615 Spring 2022

MA 415/615 Data Science in R

Introduction to R, the computer language written by and for statisticians. Emphasis on data exploration, statistical analysis, problem solving, reproducibility, and multimedia delivery. Effective Fall 2020, this course fulfills a single unit in the following BU Hub area: Critical Thinking.

Notes

The course will be a flipped classroom, with pre-recorded lectures being posted to watch at home and in class we will focus on examples, case studies, and group activities. A core part of the course is a semester long group project where you will be analyzing data sets of your choice and building a website using Blogdown to illustrate your findings.

Tentative Syllabus

Logistics

  • Lectures MWF 2:30 PM - 3:20 PM
    • PHO 201
  • Discussion Section
    • Section A2 M 3:35 PM - 4:25 PM
    • Section A3 M 4:40 PM - 5:30 PM
    • Section A4 Th 11:15 AM - 12:05 PM

All times are Eastern Time.

Instructors

Prof. Daniel Sussman

Office hours:

  • TBA

TF TBA

Description

Data Science can be described as a more modern spin on Statistics, more concerned with the analysis of large, varied, and rich datasets. For this reason, it has gathered interest from many domain specific disciplines and acquired a stronger computational focus. The goal of this course is to provide an introduction to Data Science that covers the most typical analyses, including data wrangling — cleaning and tidying — exploratory data analysis (EDA), data visualizations, modelling, and, finally, effective communication of results.

Typical Data Science process, from data entry to final product.

To perform these tasks, we will be using R, a programming language specially geared to data analysis, with a broad collection of specialized packages (add-ins), and its most popular interface, RStudio. Because students develop many written products, including code and documentation, we also cover a version control system called Git.

R is open source and can be found at R project. R’s official package repository is CRAN. RStudio has an open source version and is developed by its namesake company; check for a free version at RStudio.com.

This course promotes a hands-on approach where students practice concepts during lecture and lab sessions, and work on the delivery of finalized products covering data analysis, presentations, and reproducible research. Students are evaluated based on homework assignments, blog posts, and a final project.

Prerequisites

Basic programming (at the level of CS 111) and basic statistics and probability (at the level of MA 113/115/213).

Syllabus

Week Lecture Lab
1 Introduction [R1,2] Intro and setup
2 Data visualization [R3] Workflow: scripts and projects [R6,8]
3 Data transformation [R4,5] Data visualization
4 Exploratory data analysis [R7] Data transformation
5 Data import [R10,11] Git
6 Tidy and relational data [R12,13] EDA 1
7 Strings and factors [R14,15] Tidying data
8 Factors and dates [R15,R16] EDA 2
9 Functions, vectors and iterators [R19–21] Project Work
10 Modelling: linear and logistic regressions Project Work
11 Interactive communication Data modelling
12 Project consultancy Project Work
13 Spatial applications Interactive comm. with Shiny
14 Web scraping and databases Project work
15 Project sharing Project Work

Table: Tentative class schedule. References within [] are chapter readings from R4DS.

The course is roughly divided in four parts: weeks 1–4 cover data exploring, weeks 5–8 are about data wrangling, in weeks 9–12 we discuss programming and modeling, and weeks 13–15 have special topics and project presentations.

Textbook

We adopt the following reference for most of the course, as can be seen from the readings in the syllabus:

  • Garrett Grolemund and Hadley Wickham^[Hadley is one of the most active R contributors, having co-authored the tidyverse, a collection of modern R packages we will be using extensively in this course.]. R for Data Science (R4DS). O’Reilly Media. First edition. ISBN: 978-1491910399.

The later topics in the course are covered from class notes and selected readings. The textbook contains many other references that cover from more basic concepts to more advanced methods. We also recommend the (short) exercises in R4DS as a way to practice the concepts in smaller chunks. Moreover, the R community is very active and provides documentation, working examples of package applications, and online discussions.

R for Data Science

R4DS is available as an online reference.

Course Evaluation

This course is very hands-on, so evaluation will be based on homework assignments, blog posts, class activities, and a final project. Homework assignments are individual, following on classroom work and focus on coding, data analysis, and documentation explaining all the work. In-class activities give students the practical experience needed for real world data analyses and, in particular, for the course’s final project.

For homework assignments and lab reports we provide templates that are to be filled as the students learn to write code and documentation and to control versions. However, less scaffolding is offered as the semester progresses!

The final project integrates the main topics in the course and consists of a more significant, realistic data analysis project. It should be comprehensive, from data scraping, wrangling, visualizing, summarizing to a consistent deliverable. The project should, in summary, answer a relevant question of interest with data storytelling.

Students will tackle in-class activities and the project in teams, to exercise collaborative work. Teams will consist of 4 to 5 participants. All team members are expected to contribute equally to the completion of each lab report and project. Each team presents their final product in the last week of classes.

Grading

Homework assignments represent 30% of the final grade, lab reports respond for further 30%, and the final project covers the remaining 40% of the course grade^[$G = 0.3 \times H + 0.3 \times L + 0.4 \times P$]. Both homework and lab report with the lowest scores for each student will be dropped when computing homework and lab grades.

Communication

  • Please refer to either Blackboard or Slack for course materials.
  • All lecture recordings will be posted on Blackboard.
  • Grades are on Blackboard but grade feedback (and other feedback) will be on Github.
  • Most communication will be over Slack.
    • Please contact me over Slack (rather than email).
    • In order to keep things organized, make sure to make posts in the correct channel.
    • Also, post in the thread corresponding to a given lecture or assignment if the question is specific to that assignment.

Policies

Participation

In this course class time—either in lecture or lab—is designed to be as interactive as possible. The reasoning is that, like Math, one can only really learn programming by practicing, so we provide as many activities as possible. We also encourage Participation in class, especially when helping fellow students!

Late work

Late homeworks and lab reports will only be accepted up to 24 hours after their deadline but will carry a -20% penalty. Later homeworks and lab reports and late final projects will not be accepted.

Diversity and Inclusiveness

We welcome students from diverse backgrounds, perspectives, experiences, and identities—including gender identity, sexuality, disability, age, socioeconomic status, ethnicity, religion, and culture. We intend to serve well all students in this course, to address learning needs, to present materials that are respectful of diversity, and to foster an inclusive learning environment by sharing the view that diversity brings strengths and benefits.

Academic Conduct

Your conduct in this course, as with all BU courses, is governed by the BU Academic Conduct Code. Graduate students should observe the Graduate School (GRS) Conduct Code. In particular,

  • Students should not share code with other student in individual assignments or with other teams in team assignments. Students are, however, welcome to discuss problems together and ask for advice.
  • Any code or argument based on online resources, including StackOverflow, should be acknowledged and referenced. Otherwise, assignments containing uncited material will be treated as plagiarism^[This tends to be a common temptation, given the wealth of online resources.].
  • Cheating on exams or plagiarism on assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the Conduct Code, and will not be tolerated.
Daniel Sussman
Daniel Sussman
Assistant Professor of Mathematics and Statistics

Dan Sussman is an Assistant Professor in the Department of Mathematics and Statistics at Boston University.