Course Description: Data Science is the study of the tools and processes used to extract knowledge from data. This course introduces students to this important, interdisciplinary field with applications in business, communication, healthcare, etc. Students learn the basics of data collection, data organization, packaging, and delivery. Simple algorithms and data mining techniques are introduced.
Required Texts and Materials: Doing Data Science. Kathy O’Neill and Rachel Schutt. O’Reilly Media. 2013.
Office Hours: by appointment
In this class, we will learn the data science process through experiencing it twice. First, you'll work through four assignments to create one data analytic. Then, you'll create your own analytic of your choice in your final project. All submitted code should be properly commented and should run immediately when opened. Include readme.txt files for how I should run your code.
Data Collection: Out on 1/19, Due 1/26
Due 1/26 at 11:59pm
Choose from one of three RSS data sources that publish data regularly. You will work with this data for all four assignments. Create a cronjob that runs either a BASH script or Python code (your choice) to download the RSS feed onto the server for a week. Additionally, create a second script that downloads your choice of keywords from twitter once a day. Don't forget to remove your cronjobs after a week! The scripts must automatically download data at least once a day and the RSS one should reflect the frequency that the RSS feed is updated. The data should be saved with the date included in the name of the file and also the type of data it is. Submit the code for both scripts and at least three data files for each data type as a zip file with your name.
Exploratory Data Analysis and Cleaning: Out on 1/26, Due 2/2
Due 2/16 at 11:59pm For this assignment, you will need to parse and clean your data and compute statistics across your files (at least 3 files). First, load and parse your files and choose three features or paramaters that you think are interesting. Create a single new CSV file that contains rows for each entry (e.g., one hour, day, article, etc) and columns for each feature across all your files (e.g., each location, word, etc). Be sure to convert the data to the correct data types (numbers should not be strings, etc). In a text file, describe your decisions about what is a row, the data types that exist in your data and whether they needed to be converted to other types for your columns.
Next, look at each column of the data. Are there outliers that should be removed? Are there strings in strange formats? Answer these questions in your text file, fix the actual data if possible, and describe your steps to fix your data appropriately. Then, compute and write up an exploratory data analysis of the columns including the means, ranges, and standard deviations of the data and/or frequency counts of non-numeric data (places, common words, etc).
Submit at least 4 files in a tar: Jupyter Notebook with the code for merging, cleaning, and analyzing your data, your original files, your new file, and your writeup of what your data contains, your procedure for cleaning it, and your final analysis results.-->
Data Visualization: Out 2/2, Due 2/16
Due 2/23 at 11:59pm Data is never perfect when it is downloaded. You will need to transform your data into a format that is easier for analysis. Create code to produce 3 new features of your data based on what you've learned about your domain (weather, earthquakes, trending news). Write up a readme file describing what each new field represents. Make sure that at least one of your features combines two or more of the original fields and at least one of your features analyzes a textual field in the original file. These features should represent the independent variables we will use in your next assignment to perform your data analysis. Then visualize your new features. Use appropriate plots including histograms, time series, or scatter plots for each new feature and for two pairs of features. Submit your jupyter notebook code to create new features and plots, your old data file and your new data file, and your readme of what the features are.
Data Analysis: Out 2/23, Due 3/2
How does so much mail get to your house so quickly? The post office collected thousands of samples of handwriting from people all over the country and use a classifier to determine what zip code you wrote on your letter. In this assignment, you will build a classifier for this data yourself. Upload this training and testing dataset to your server space.
To load the data:
import numpy as np
Then, instantiate a Naive Bayes classifier (GaussianNB), train a Naive Bayes classifier using your training data. If you want to look at how the training would be different, you can split the training file into training and validation by using the first N lines and the rest of the lines respectively. Using the testing data, test your classifier's accuracy. The final line of your notebook should return the accuracy of your classifier on the testing data. Then, write a text (non-code) block that explains what the accuracy means (i.e., does it get a lot of the handwriting wrong?) and which other classifier you could use and why. Submit your jupyter notebook only to the website.
f = open("filename.txt")
f.readline() # skip the header
data = np.loadtxt(f)
X = data[:, 1:] # features are columns 1 through end
y = data[:, 0] # labels are column 0
Out 3/2, Due 4/16
You have now learned enough concepts that you should be able to create an analytic of your own. You can use the data that you already have, some of my data, download from kaggle or other sources, or collect something new. You'll write a project proposal detailing your data and expected analytics and visualizations, then perform exploratory data analysis, data cleansing, and then compute your data analytics. The culmination of the project will be both a final paper and a presentation to the class.
Task 1 - Proposal: Due 3/19. Please write a 3 paragraph proposal detailing the project you would like to create. It should include how you will find the data if you need any new data. It should include a description of the task you would like to do and your hypothesis about the data. And it should include a rough description of the algorithms you will need to write. If you need help thinking of a project, please ask. I can supply you with project ideas. UPLOAD IT TO MOODLE.
Task 2 - Mid-Point Check In 4/4-4/6. Bring code to class 4/4 to discuss with me. On 4/6, please submit a 2 page writeup detailing what you have accomplished, your algorithm design decisions, what you have remaining to do, and what you need help with. Submit writeup on Moodle (more information about rubric also there).
Task 3 - Final Presentation: Due 4/16-4/18. Please prepare a 10 minute presentation including an introduction to the problem your code is trying to solve and your hypothesis, the data you used and how it was collected and cleaned (if you used any), the algorithms you wrote and why you made the design decisions you did, the challenges you faced while writing the code, and lessons learned. Include a slide on the potential bias and/or ethical concerns about your data or analytic. UPLOAD IT TO MOODLE.
Task 4 - Final Code: Due 4/20. You should submit your final code, any data files that are nee ded to run it, documentation on how to run it, and commented code so that I can read it. Any place where you think I might wonder why you wrote code the way you did, please add extra comments.
Task 5 - Final Paper on Ethics due 4/20. Write 2 pages about any bias or ethical concerns in your data or your analytic or in a pop culture topic related to bias and ethics. More details on Moodle.
1/1: No Class
1/3: What is Data Science?
1/5: Applications of Data Science
1/8: Research Methods
1/10: Research Methods
1/12: Data Science Process
1/15: No Class
1/17: Downloading Data
Assignment 1 out
1/22: Data Parsing
1/24: Data Conversion
1/26: Jupyter Notebooks
Assignment 1 due
Assignment 2 out
1/29: Data Cleaning
1/31: Data Summarization
2/2: Combining Features
Assignment 2 due
Assignment 3 out
2/5: Principles of Viz
2/7: Counts and Histograms
2/9: Tables, Charts, Graphs
2/12: Bayes Rule
2/14: Bayes and Analytics
2/16: Python SciKit-Learn
Assignment 3 due
Assignment 4 out
2/19: No Class
2/21: Machine Learning Classification
2/23: Machine Learning Regression
2/26: More Classification
2/28: More Classification
Assignment 4 due
3/5: Spring Break
3/7: Spring Break
3/9: Spring Break
3/12: Midterm Review
3/16: Recommender Systems and Text Analysis
3/19: Boosting, Proposal due
3/21: Neural Networks, Graphs
3/23: Big Data
3/26: Artificial Intelligence
3/28: Artificial Intelligence
3/30: Work in Class
4/6: Work in Class
4/20: Final Thoughts
Office Hours: by appointment