# MA 415/615 Fall 2021

MA 415/615 is full for Fall 2021 with a long waitlist. I am not taking any requests to join the course. If you are interested in taking the course in the Spring, see below.

## MA 415/615 Data Science in R

Introduction to R, the computer language written by and for statisticians. Emphasis on data exploration, statistical analysis, problem solving, reproducibility, and multimedia delivery. Effective Fall 2020, this course fulfills a single unit in the following BU Hub area: Critical Thinking.

#### Notes

This will be my fifth semester teaching this course. As with the previous two semesters, the course will be a flipped classroom, with pre-recorded lectures being posted to watch at home and in class we will focus on examples, case studies, and group activities. A core part of the course is a semester long group project where you will be analyzing data sets of your choice and building a website using Blogdown to illustrate your findings.

# Syllabus

## Logistics

• Lectures MWF 9:05 AM - 9:55 AM
• PHO 201

All times are Eastern Time.

## Instructors

Prof. Daniel Sussman

Office hours:

• TBA

TF TBA

## Description

Data Science can be described as a more modern spin on Statistics, more concerned with the analysis of large, varied, and rich datasets. For this reason, it has gathered interest from many domain specific disciplines and acquired a stronger computational focus. The goal of this course is to provide an introduction to Data Science that covers the most typical analyses, including data wrangling — cleaning and tidying — exploratory data analysis (EDA), data visualizations, modelling, and, finally, effective communication of results.

To perform these tasks, we will be using R, a programming language specially geared to data analysis, with a broad collection of specialized packages (add-ins), and its most popular interface, RStudio. Because students develop many written products, including code and documentation, we also cover a version control system called Git.

R is open source and can be found at R project. R’s official package repository is CRAN. RStudio has an open source version and is developed by its namesake company; check for a free version at RStudio.com.

This course promotes a hands-on approach where students practice concepts during lecture and lab sessions, and work on the delivery of finalized products covering data analysis, presentations, and reproducible research. Students are evaluated based on homework assignments, lab reports, and a final project.

## Prerequisites

Basic programming (at the level of CS 111) and basic statistics and probability (at the level of MA 113/115/213).

## Syllabus

Week Lecture Lab
1 Introduction [R1,2] Intro and setup
2 Data visualization [R3] Workflow: scripts and projects [R6,8]
3 Data transformation [R4,5] Data visualization
4 Exploratory data analysis [R7] Data transformation
5 Data import [R10,11] Git
6 Tidy and relational data [R12,13] EDA 1
7 Strings and factors [R14,15] Tidying data
8 Factors and dates [R15,R16] EDA 2
9 Functions, vectors and iterators [R19–21] Project Work
10 Modelling: linear and logistic regressions Project Work
11 Interactive communication Data modelling
12 Project consultancy Project Work
13 Spatial applications Interactive comm. with Shiny
14 Web scraping and databases Project work
15 Project sharing Project Work

Table: Tentative class schedule. References within [] are chapter readings from R4DS.

The course is roughly divided in four parts: weeks 1–4 cover data exploring, weeks 5–8 are about data wrangling, in weeks 9–12 we discuss programming and modeling, and weeks 13–15 have special topics and project presentations.

## Textbook

We adopt the following reference for most of the course, as can be seen from the readings in the syllabus:

• Garrett Grolemund and Hadley Wickham^[Hadley is one of the most active R contributors, having co-authored the tidyverse, a collection of modern R packages we will be using extensively in this course.]. R for Data Science (R4DS). O’Reilly Media. First edition. ISBN: 978-1491910399.

The later topics in the course are covered from class notes and selected readings. The textbook contains many other references that cover from more basic concepts to more advanced methods. We also recommend the (short) exercises in R4DS as a way to practice the concepts in smaller chunks. Moreover, the R community is very active and provides documentation, working examples of package applications, and online discussions.

R4DS is available as an online reference.

## Course Evaluation

This course is very hands-on, so evaluation will be based on homework assignments, blog posts, class activities, and a final project. Homework assignments are individual, following on classroom work and focus on coding, data analysis, and documentation explaining all the work. In-class activities give students the practical experience needed for real world data analyses and, in particular, for the course’s final project.

For homework assignments and lab reports we provide templates that are to be filled as the students learn to write code and documentation and to control versions. However, less scaffolding is offered as the semester progresses!

The final project integrates the main topics in the course and consists of a more significant, realistic data analysis project. It should be comprehensive, from data scraping, wrangling, visualizing, summarizing to a consistent deliverable. The project should, in summary, answer a relevant question of interest with data storytelling.

Students will tackle in-class activities and the project in teams, to exercise collaborative work. Teams will consist of 4 to 5 participants. All team members are expected to contribute equally to the completion of each lab report and project. Each team presents their final product in the last week of classes.

Homework assignments represent 30% of the final grade, lab reports respond for further 30%, and the final project covers the remaining 40% of the course grade^[$G = 0.3 \times H + 0.3 \times L + 0.4 \times P$]. Both homework and lab report with the lowest scores for each student will be dropped when computing homework and lab grades.

## Communication

• Please refer to either Blackboard or Slack for course materials.
• All lecture recordings will be posted on Blackboard.
• Grades are on Blackboard but grade feedback (and other feedback) will be on Github.
• Most communication will be over Slack.
• In order to keep things organized, make sure to make posts in the correct channel.
• Also, post in the thread corresponding to a given lecture or assignment if the question is specific to that assignment.

## Policies

### Participation

In this course class time—either in lecture or lab—is designed to be as interactive as possible. The reasoning is that, like Math, one can only really learn programming by practicing, so we provide as many activities as possible. We also encourage Participation in class, especially when helping fellow students!

### Late work

Late homeworks and lab reports will only be accepted up to 24 hours after their deadline but will carry a -20% penalty. Later homeworks and lab reports and late final projects will not be accepted.

### Diversity and Inclusiveness

We welcome students from diverse backgrounds, perspectives, experiences, and identities—including gender identity, sexuality, disability, age, socioeconomic status, ethnicity, religion, and culture. We intend to serve well all students in this course, to address learning needs, to present materials that are respectful of diversity, and to foster an inclusive learning environment by sharing the view that diversity brings strengths and benefits.