About DSA150

Course Description: Data Science is the study of the tools and processes used to extract knowledge from data. This course introduces students to this important, interdisciplinary field with applications in business, communication, healthcare, etc. Students learn the basics of data collection, data organization, packaging, and delivery. Simple algorithms and data mining techniques are introduced.

Required Texts and Materials: Doing Data Science. Kathy O’Neill and Rachel Schutt. O’Reilly Media. 2013.

Contact Details

Stephanie Rosenthal
Chatham University
Falk 116C
Office Hours: by appointment



In this class, we will learn the data science process through experiencing it twice. First, you'll work through four assignments to create one data analytic. Then, you'll create your own analytic of your choice in your final project. All submitted code should be properly commented and should run immediately when opened. Include readme.txt files for how I should run your code.

Assignment 1

Data Collection: Out on 1/19, Due 1/26

Due 1/26 at 11:59pm Choose from one of three RSS data sources that publish data regularly. You will work with this data for all four assignments. Create a cronjob that runs either a BASH script or Python code (your choice) to download the RSS feed onto the server for a week. Additionally, create a second script that downloads your choice of keywords from twitter once a day. Don't forget to remove your cronjobs after a week! The scripts must automatically download data at least once a day and the RSS one should reflect the frequency that the RSS feed is updated. The data should be saved with the date included in the name of the file and also the type of data it is. Submit the code for both scripts and at least three data files for each data type as a zip file with your name.

Select file to upload name-1.tar:

Assignment 2

Exploratory Data Analysis and Cleaning: Out on 1/26, Due 2/2

Due 2/16 at 11:59pm For this assignment, you will need to parse and clean your data and compute statistics across your files (at least 3 files). First, load and parse your files and choose three features or paramaters that you think are interesting. Create a single new CSV file that contains rows for each entry (e.g., one hour, day, article, etc) and columns for each feature across all your files (e.g., each location, word, etc). Be sure to convert the data to the correct data types (numbers should not be strings, etc). In a text file, describe your decisions about what is a row, the data types that exist in your data and whether they needed to be converted to other types for your columns.

Next, look at each column of the data. Are there outliers that should be removed? Are there strings in strange formats? Answer these questions in your text file, fix the actual data if possible, and describe your steps to fix your data appropriately. Then, compute and write up an exploratory data analysis of the columns including the means, ranges, and standard deviations of the data and/or frequency counts of non-numeric data (places, common words, etc).

Submit at least 4 files in a tar: Jupyter Notebook with the code for merging, cleaning, and analyzing your data, your original files, your new file, and your writeup of what your data contains, your procedure for cleaning it, and your final analysis results.

Select file to upload name-2.tar:

Assignment 3

Data Visualization: Out 2/2, Due 2/16

Due 2/23 at 11:59pm Data is never perfect when it is downloaded. You will need to transform your data into a format that is easier for analysis. Create code to produce 3 new features of your data based on what you've learned about your domain (weather, earthquakes, trending news). Write up a readme file describing what each new field represents. Make sure that at least one of your features combines two or more of the original fields and at least one of your features analyzes a textual field in the original file. These features should represent the independent variables we will use in your next assignment to perform your data analysis. Then visualize your new features. Use appropriate plots including histograms, time series, or scatter plots for each new feature and for two pairs of features. Submit your jupyter notebook code to create new features and plots, your old data file and your new data file, and your readme of what the features are.

Select file name-3.zip to upload:

Assignment 4

Data Analysis: Out 2/23, Due 3/2

How does so much mail get to your house so quickly? The post office collected thousands of samples of handwriting from people all over the country and use a classifier to determine what zip code you wrote on your letter. In this assignment, you will build a classifier for this data yourself. Upload this training and testing dataset to your server space. Using the training data, train a Naive Bayes classifier. Using the testing data, return the accuracy of your classifier.

Select file name-4.zip to upload:


Out 3/12, Due 4/16





1/1: No Class

1/3: What is Data Science?

1/5: Applications of Data Science

1/8: Research Methods

1/10: Research Methods

1/12: Data Science Process

1/15: No Class

1/17: Downloading Data

1/19: Daemons
Assignment 1 out

1/22: Data Parsing

1/24: Data Conversion

1/26: Jupyter Notebooks
Assignment 1 due
Assignment 2 out

1/29: Data Cleaning

1/31: Data Summarization

2/2: Combining Features
Assignment 2 due
Assignment 3 out

2/5: Principles of Viz

2/7: Counts and Histograms

2/9: Tables, Charts, Graphs

2/12: Bayes Rule

2/14: Bayes and Analytics

2/16: Python SciKit-Learn
Assignment 3 due
Assignment 4 out

2/19: No Class

2/21: Machine Learning Classification

2/23: Machine Learning Regression

2/26: More Classification

2/28: More Classification

3/2: More Regression
Assignment 4 due
Project out

3/5: Spring Break

3/7: Spring Break

3/9: Spring Break

3/12: Midterm

3/14: Neural Networks and Graphs

3/16: Recommender Systems and Text Analysis
Proposal due

3/19: Boosting

3/21: Clustering

3/23: Big Data

3/26: Artificial Intelligence

3/28: Artificial Intelligence

3/30: Work in Class
Midpoint due

4/2: Security

4/4: Ethics

4/6: Work in Class

4/9: Speaker

4/11: Speaker

4/13: Final Exam Review

4/16: Presentations

4/18: Presentations

4/20: Presentations

Final Exam

Get In Touch.

Contact Details

Stephanie Rosenthal
Chatham University
Falk 116C
Office Hours: by appointment