Practicums

The practicum is a required, central part of our curriculum which provides students with the opportunity to gain real world experience working with our industry partners. Each project is sponsored by a company, allowing students to work with partner companies to gain analytics experience and reconcile mathematical theory with business practice. Student groups — supervised by an MSDS faculty member — work with the practicum company to identify, define, scope, and analyze a particular business problem. All groups are additionally supported and supervised by the MSDS practicum director.

Following an initial hypothesis, students typically engage in data acquisition, exploratory data analysis, feature extraction, model development and evaluation, as well as oral and written communication of results. Class schedules are set so that students can work onsite one to two days per week.

  • Practicums begin in mid-October.
  • Students devote 15 hours a week to practicum on average.
  • Projects may be paid or unpaid.
Select Practicum Partners
Airbnb
AT&T
BART
Clorox
Dictionary.com
Eventbrite
First Republic Bank
Flyr
Houston Astros
Kiva
Metromile
Lawfty
LootCrate
Mozilla
Nextdoor
Oakland Athletics
Orange
PayPal
Recology
SEGA
SF 49ers
The Coca-Cola Company
Turo
UCSF
Vida Health
Williams-Sonoma
Zipcar

Past Projects

2017

Aki Technologies

Our Team: Arda Aysu, Joshua Amunrud

Goal: Predict complex human activity using mobile device accelerometer and gyroscopic data and detect possible fraud by analyzing impression level data

Joshua and Arda researched several digital signal processing techniques and strategies for clustering for time series data. In the end, we used Python to produce a random forest model that used processed data to predict the activities. They looked at one month of impression level data and tried to identify publishers with unusual characteristics. This is an ongoing project, but tools for further exploration were built in Python.

BART

Our Team: Cameron Carlin, Mikaela Hoffman-Stapleton

Goal: Forecast ridership by gathering and analyzing external factors relevant to ridership demand

At BART, many factors come into play regarding how, when, and where people decide to take public transportation. With recent changes in the transportation industry and growing competition, it is more critical than ever to accurately forecast ridership to plan finances into the future. Cameron and Mikaela used R to develop a SARIMAx time series forecast model incorporating external factors and government data to determine ridership covariates. This modeling algorithm was implemented in a Shiny application to allow for a wider audience at BART to take advantage of these forecasts.

Capital One

Our Team: Nick Levitt, Kyle Kovacevich

Goal: To build predictive models and ETL pipelines for a cyber security project

Nick and Kyle used a combination of advanced machine learning techniques such as Deep Learning, NLP and network analysis to find outliers and patterns in the data. Work was implemented in Python, PostgreSQL and Spark on top of Hadoop and Parquet.

Clorox

Our Team: Rui Li, Elise Song

At Clorox we were presented with a business challenge of exploring important factors that correlated with short term sales fluctuations after the breakout of an event. We scraped more than 28 million news titles and 102k full articles relevant to the event, conducted sentiment analyses on 14 million Tweets and Instagram posts for feature engineering, and found significant results from regression analysis. Our presentation was well-received by the product team and the data science team.

Craft

Our Team: Keyang Zhang

Goal: Extracting key sentences and detecting signals from news

Keyang built a name entity recognition algorithm in Python to extract the main company name from news. He also used Latent Dirichlet Allocation, Word2vec and FuzzyWuzzy to perform key words and key sentences extraction. Based on the key words and key sentences, Keyang used Gaussian Naive Bayes to build classifiers to detect the signals from each piece of news.

Coca-Cola

Our Team: Dominic Vantman, Justin Midiri

Goal: Create a Python program that aggregates sales, consumer, and syndicated data into a unified information database and perform advanced analytics on promotional performance and return on investments

Dominic and Justin designed data visualization dashboards to further understand financial performance metrics, customer profitability reconciliation, and promotional trade-spend optimization tactics that yield highest growth across their robust portfolio of customers, products, and geographical markets.

Convergence Investment Management

Our Team: Linda Liu

Goal: Perform exploratory data analysis on different aspects of the financial market and use regression and classification methods to forecast/identify alpha

Linda leveraged ensemble method using stacking and majority vote to detect rare events in the financial market. She also wrote machine learning models that could be deployed into production.

Dictionary.com

Our Team: Claire Broad

Goal: Identify and rank new or missing words for potential inclusion in the dictionary

Claire developed an autonomous system in Python using sci-kit learn to generate a validity score for the items on the monthly list of unmatched queries on the Dictionary site. Her algorithm incorporated signals from lexical structure, query patterns, and usage on social media, and used a novel ensembling technique to mitigate the effect of noise in the training set.

Ebates

Our Team: Sheri Nguyen, Keyang Zhang

Goal 1: Build an anomaly detection model that found breakages in affiliate reporting

Sheri and Keyang built an anomaly detection system using a combination of the Bayesian Change Point algorithm and the Twitter Anomaly Detection package by integrating R into their Python programs using RPY2. Their model successfully caught some major partner reporting breakages on various platforms.

Goal 2: Build a daily revenue forecasting model

Sheri and Keyang built a model to predict daily revenue. This was an important implementation for Ebates' marketing team, since their daily data funnel from affiliates were often times delayed, an accurate revenue prediction model was necessary in order to make important weekly business decisions of where to allocate marketing campaigns. They implemented a few different algorithms, including Linear Regression and Random Forest. Their model was put into production with a 5-6% error rate.

Goal 3: Build a model to find which customers should be sent reminder emails to Refer-A-Friend to Ebates

Sheri and Keyang built a model to pick out the most likely customers to refer Ebates to a friend. The Refer-A-Friend program is one of the highest revenue generating programs offered by Ebates. They first filtered out Ebates customers for fraudulent behavior and then assigned probabilities to each customer. The customers with the highest probabilities were then sent an email to remind them about the Refer-A-Friend program offered by Ebates. Sheri and Keyang's model improved Ebates' original classification model performance from 13% true positive detection to 80% true positive detection. Their model is currently being used in production.

Eventbrite

Our Team: Kelsey MacMillan

Goal: Improve site search and recommendation by extracting high-level features from event data In addition to organizers that Eventbrite has direct relationships with, there are many individual organizers around the world who sign on to Eventbrite by themselves and create events. Sorting through this large inventory of individually organized events to match attendees to their next experience is a challenge. To help address this challenge, Kelsey implemented an unsupervised topic modeling method called Non-Negative Matrix Factorization that extracts key topic "tags" from events using the raw text from their titles and descriptions. For the always-tricky task of validating an unsupervised method, Kelsey built a different type of topic model using a probabilistic approach called Latent Dirchlet Allocation and checked for stability of the found topics across both types of models.

Our Team: Hannah Lieber

Goal: Help the sales and marketing teams better understand characteristics of their revenue generating organizers. Hannah used Hive and Python to develop a random forest model to classify what organizers are likely to churn, allowing Eventbrite to preemptively take action in retaining these organizers.

First Republic Bank

Our Team: Yige Liu, Anshika Srivastava

Goal: Deposit diversification study and network analysis

Anshika and Yige studied the volatile segments of the deposits to understand if increase in segment balance leads to moderation in volatility. They also used graph theory to design the pseudo algorithm to detect households and identify influential people in the bank’s network using the bank’s historical data. They acquired data from multiple servers and databases using SQL (SQL server) and performed the analysis using Python Pandas and Matplotlib.

Flyr

Our Team: Graham McAlister, Derek Welborn, Yixin Zhang

Graham built a predictive anomaly detection algorithm by implementing an Isolation Forest in Python. The system takes predictions for search demand of flight characteristics and returns the probability that this point comes from the "normal" distribution in the data set.

Derek built a price prediction system for flights. His solution used gradient boosting implemented using SciKit-Learn in Python. The model is hosted in a flask app and is being used by the data science team at Flyr.

Yixin analyzed the tickets that Flyr's customer support deals with to help them better focus their efforts. Her work broke down which partners were most likely to create different issues and how expensive they were to deal with. She visualized this data using both Python's Matplotlib and R Shiny.

Gyft

Our Team: Su Wang

Goal: Efficiently query, analyze and communicate data to answer business questions

Gyft, a wholly-owned subsidiary of First Data Corporation, is a leading digital gift card platform with a top-rated mobile gift card app for iPhone and Android. With a high-traffic websites and massive amounts of data, Gyft needs data analysts to work with cross-functional teams, efficiently query, analyze and communicate data in a fast-paced Internet business environment. A typical day for a Data Analyst intern at Gyft involves creating and maintaining KPI dashboards, working with different departments to understand business questions, coming up with the right metrics, querying the right data efficiently from database, extracting insights using various statistical techniques, and conveying the right message to the teams. SQL is used frequently, R and Python are used when statistical analysis is needed.

Home.ai

Our Team: Vyakhya Sachdeva, Evelyn Peng

Goal 1: Identify frequently visited places for users and predict commutes between those places

Vyakhya and Evelyn worked on 2 major projects at Home.ai. In the first project, they developed production-ready solution to identify frequently visited places based on mobile/GPS location data using DBScan and Gaussian Mixture clustering algorithms. Given the list of places learned in the previous step, they developed an algorithm to identify commutes between these places. They used logistic regression to predict users’ next destination given a departure place and time. Their implementation improved accuracy of existing system by 30%.

Goal 2: Predict states of different home devices based on sensor data from IoT devices, time, and environment factors

Their second project was related to home device automation, in which they combined time-series data from IoT devices (motion sensors, electric outlets, door locks, etc.), environmental data (temperature, time of day, etc.), and user location, and built machine learning models using neural networks to anticipate users’ needs in an autonomous home. Their model achieved overall accuracy rate of ~80%.

Houston Astros

Our Team: Eric Lehman

Goal: Develop an automated algorithm to characterize the MLB strike zone for different counts and stances

Eric used both machine learning and analytic approaches to model the actual strike given 2015 and 2016 MLB pitch histories. Several different modeling approaches were investigated using Python's scikit-learn package, including LDA, decision trees and gradient boosting. The strike zone was characterized by finding the best superellipse which minimized the misclassification rate for a given count and batter stance (L or R). Detailed visualizations of the strike zone were created using R's Shiny package.

Isazi

Our Team: Christine Chu, Erin Chinn

Goal: Predict clinical characteristics of breast cancer patients using mRNA expression levels

Precision medicine is an emerging field which decides medical treatment based on a patient’s genomic content. Given the high dimensionality and complex nature of genes and their interactions, neural networks and deep learning are well suited for predictive models involved in precision medicine. Christine and Erin built a multi-task neural network model in conjunction with a denoising autoencoder to predict clinical characteristics based on the expression levels of breast cancer patients. They used multiple libraries in Python (Numpy, Sci-kit Learn, Keras, Theano) to develop and test their models. With their model, they were able to achieve 93% and 82% accuracy in predicting two important clinical features that distinguish between different types of breast cancer cells.

LA County

Our Team: Matt McClelland

Goal 1: Modeling Los Angeles County deed transactions for accurate forecast

The goal of this project was to recreate a time series model provided to LA County by UCLA economist William Yu. The resulting ensemble model uses VAR techniques and Dynamic Regression with numerous autoregressive terms.

The nature of this project required GGplot for the purposes of validation and ease of forecast projection. The actual statistical implementation required various packages combined with custom functions to achieve ensemble forecast results. Further work needs to be done to bring this function in house as a cleaned working product

Goal 2: Cleaning, aggregating and visualizing LA County voting data

As an open-ended exploration into LA County Vote By Mail data, there was much work done to both clean and augment the data. Data augmentation was done using the Google geocoding API and R’s gmap package for lat/lon and distance queries. Voter data was also supplemented by Census data which merged at the census tract level. Using this data, we were able to generate several EDA style reports for LA County’s in-house reference. The visualization was made using Leaflet and Shiny. Currently this is to be an in-house application for LA County, however there is potential to potential to make this public facing.

LendUp

Our Team: Francisco Calderon Rodriguez

Goal: Build a classification model to predict the likelihood of a customer email being a complaint

Francisco extracted email texts using Python from customer management platform that were formatted in JSON. He collected over 5,000 non-complaint emails and 260 complaint emails and applied Count-Vectorization from Scikit-Learn to generate frequencies for each of the 6,000 features. Modeled features using Logistic Regression with an L1 Penalty to perform feature selection.

Loot Crate

Our Team: Cameron Carlin, Mikaela Hoffman-Stapleton

Goal: Understand customer sentiment and demand to gain insight into future product diversity and customer desires

At Loot Crate, a subscription-based "geek and gaming" memorabilia service, understanding what types of products consumers want and how they feel about the subscription offerings currently available is paramount to success. Cameron and Mikaela used Natural Language Processing (NLP), specifically text analyzers and sentiment analysis, to quantify customer experience and explore trends around historical consumer insights. These results were combined with Naive Bayes Classification and exploratory analysis of historical product offerings to help guide future Loot Crate product curation.

Mindlight Medical

Our Team: Spencer Smith

Goal: Build a web interface for clinics to upload EEG data

Data was parsed then saved in a MongoDB. Spencer developed a new algorithm for classifying developmental trajectories of medical data. He has plans to publish this innovation.

Mozilla

Our Team: Connor Ameres, Andre Guimaraes Duarte

Goal: Create an interactive dashboard of Firefox crash statistics

Andre and Connor used Spark (PySpark + SparkSQL) to create an ETL pipeline that generates a tri-weekly (M, W, F) report of crash analysis on a representative 1% sample of the population from Firefox's release channel on desktop. They used Mozilla's MetricsGraphics.js library (D3) in order to produce an interactive visualization of these data.

In addition, Andre performed several ad-hoc analyses of Firefox user behavior data, such as finding a correlation between heavy users and early adopters using logistic regression. Connor also used various clustering methods and anomaly detection algorithms, like Isolation Forests, to segment users based on their corresponding engagement metrics.

Oakland Athletics

Our Team: Joshua Amunrud

Goal: Predict game-level attendance for 2017 season and analyze Stubhub ticket listings over time

Game-level data includes ticket sales, day of week, opponent, promotion type, and more. A linear regression model was fit to this data in order to both predict game-by-game attendance and to quantify the change in factors. This information could be used to cluster the games for the multi-game ticket packs. Another project involved R with ggplot2 to plot ticket listings over time in order to gain an intuition for how prices change relative to gameday.

PowWow Energy

Our Team: Will Young

Goal: Predict water stress in almond trees using remotely sensed data (e.g. weather, aerial imagery)

Will used established methods in water stress management to engineer features from remote data. These features were used as input to a Gradient Boosting algorithm to predict water stress at the tree level. Other projects included a Kmeans algorithm to measure tree diameter from aerial imagery. He primarily used Python with Numpy, Sklearn, and Pandas.

San Francisco 49ers

Our Team: Melanie Palmer

Goal: Model purchase propensity for season ticket holders as well as third party events to increase sales leads

Melanie used Python to build a gradient boosting classifier to identify purchasers from nearly half a million records. The model was integrated into a dynamic server using Redshift and Crontab to update likelihood-to-purchase scores and predictions on a weekly basis. Data was analyzed and modeled using R and Tableau.

Scientific Revenue

Our Team: Tim Zhou

Goal: Investigate the presence of various economic effects within gaming metrics

Tim wrote internal reports and dashboards of various mobile gaming metrics using R Shiny. He performed various data transformations and cleaning to prepare data for modeling.

Our Team: Lin Chen

Goal: Build a data analysis dashboard and perform feature exploration to optimize models

Lin used SQL and R to analyze some phenomenons of mobile game in-app purchase, and designed the visualization dashboard using R Shiny. She used python, R, and QGIS to explore new features from the external data, performed feature engineering and feature selection, and optimized the models. By using new features, the final model achieved a 20% lift on users’ lifetime value prediction.

Scribd

Our Team: Ruixuan Zhang, Brigit Lawrence-Gomez

Goal: Develop ensemble classification model to classify book chapters and improve user reading experience

In order to ensure maximum reader satisfaction, Scribd is interested in presenting their digital reading content in the optimal way - skipping all the boring stuff at the beginning and ends of the book. In order to correctly tag their vast digital library, they rely on the power and efficiency of machine learning. We successfully applied various machine learning, feature extraction, and natural language processing tool to chapter-level book data, achieving 96% accuracy using Python, Scikit-learn, and GloVe Word Vectors.

Shippo

Our Team: Sheri Nguyen

Goal: Built a model to predict the amount of days a package will be estimated to be in transit

As a shipping API platform, Shippo aims to make shipping fast, cost-effective and easy to use for its consumers. This project served to give an alternative arrival estimation in addition to the carrier's estimated delivery date. She tested a variety of models including: Random Forest, Linear Regression, Poisson Regression and Gradient Boosting. To evaluate her results, she used K-Fold Cross Validation and error metrics such as MSE and MAE.

Silicon Valley Bank

Our Team: Jinxin Ma

Goal: Use network analysis to determine the most central customers of the bank

Jinxin used R and Python to perform network analysis using Silicon Valley Bank’s CRM data. The analysis helped determine the most central customers and thus providing the bank information on which customers to further invest in.

Simpatica Medicine

Our Team: Juan Pablo Oberhauser

Goal: Build a Data Acquisition pipeline for Fasta RNA sequencing files

The pipeline used tools such as pseudo-alignment and pyspark to download, reshape, and quantify genomic data. Juan Pablo built a tree-based ensemble classification program in Spark to diagnose several diseases and viral infections. He primarily used Spark, Python and AWS.

Summit Public Schools

Our Team: Arda Aysu

Goal: Build a model to predict student scores on external state-mandated exams using internal metrics

After gaining familiarity with the structure of Summit's student performance data, a model was built in Python to predict their test scores. Time was also spent giving insight on data-driven approaches and helping the Summit team learn Python.

Turo

Our Team: Roger Wu

Goal: Quantify the impact of pricing and other factors on conversion

Turo is a car rental marketplace where car owners can rent out their cars to travelers. Understanding how price and other factors impact conversion is important for the success of the business. Using R and SQL, Roger quantified the impact of these factors on conversion. This was done using an observational study and a logistic regression model.

UCSF

Our Team: Yichao Zhu

Goal: Explore the seasonality and regional outbreak of conjunctivitis based on social media posts

Yichao extracted tweets that mentioned conjunctivitis and all the related fields such as replies, location and time using Python, AWS and Twitter API. Social network based methods are implemented to estimate the users' location, which was not provided. She also suggested NLP (content-based method) for location estimation. She created a database to store the datasets for further usage.

UCSF Oncology

Our Team: Vincent Rideout

Goal: Use Deep Learning to study the radiation therapy pretreatment process

Vincent used convolutional neural networks to predict pretreatment passing rates from images representing the radiation dosage applied to each part of the body (fluence maps). He experimented with many different neural net architectures and hyperparameters to optimize the quality of the predictions. Transfer learning was used to improve performance on a small dataset: a model trained to excel at image recognition on the ImageNet real-world image dataset was repurposed for this problem and turned out to be the best solution. The team is about to submit an academic paper based on their findings and will be presenting at the American Association of Physics in Medicine Annual Meeting & Exhibition.

Valor Water Analytics

Our Team: Tim Zhou, Zefeng Zhang

Goal: Leveraging utility companies' water usage data to optimize operational effectiveness and revenue sources

Tim developed a hybrid clustering algorithm using KMeans and DBSCAN to help flag potentially anomalous water meters. Tim explored using Fourier analysis to identify periodic behaviour, but eventually settled on a simple autocorrelation algorithm. Zefeng investigated correlations between gas and water usage data. Zefeng developed an anomaly detection model using lag 1 Markov Chains.

Vida Health

Our Team: Lawrence Barrett

Goal 1: Identify outliers in self-recorded weight data to clean datasets

Lawrence used R and a distance-based outlier detection algorithm to identify outliers in weight data. The algorithm came from an academic paper, but it was modified to work on the weight data that Vida provided.

Goal 2: Evaluate performance of a new coach matching algorithm via A/B testing

Lawrence used A/B testing in R to confirm that the new coach matching algorithm significantly improved communication between coaches and users. He also evaluated the power of the test to determine the reliability of these results.

Goal 3: Pulling data from Vida's database for company reports

Lawrence used SQL in BigQuery, Google's relational database, to pull data for reports that helped determine the effectiveness of Vida's weight loss, diabetes prevention, and blood pressure monitoring programs. This was an ongoing need throughout most of the internship since reports needed to be updated and turned in on a monthly basis. Lawrence used a combination of SQL and R to query and clean the data for these reports.

Goal 4: Testing the Vida ChatBot

Lawrence implemented many tests on functions in production that are used to understand user text input by grabbing the important words needed to respond to the user's queries. He used Python's unit testing framework to test these functions.

Our Team: Donny Chen

Goal 1: Develop recommendation systems to personalize users’ healthcare reading

Donny performed NLP topic modeling and applied information retrieval on large documents and users’ messages to engineer features. He developed recommendation systems including collaborative filtering in Python to personalize healthcare readings upon a keywords search.

Goal 2: Data pull from Google BigQuery

Donny used SQL to retrieve and aggregate data from Google BigQuery to get the data from KPI to users’ metrics. He also used R to cleanse and visualize the data for various analytical reports.

Goal 3: Design and migrate from Google BigQuery RDBMS to Neo4j graph database

Donny contributed to designing schema and importing data for healthcare readings in Neo4j graph database to help transit from a traditional relational database to an advanced and scalable graph database. Retrieving connected information in a relational database is often cumbersome since it requires lots of tables joining which can be achieved much more efficiently in a graph database.

Goal 4: Engineered and deployed webserver event tracking of changes of users in Django

Donny used Django to automate events logging to keep track of changes in users’ information. He worked in Python and contributed to the web server in production.

Vungle

Our Team: Shivakanth Thudi, Danny Suh, Matthew Wang, Jennifer Zhu

Goal: Improve the revenue share model

Vungle’s goal is to evaluate the feasibility and profitability impact from engaging in a revenue share business model with advertisers. Matthew and Jennifer constructed a data extraction and processing pipeline using Python and Spark and built an interactive dashboard as well as a Machine Learning model to enable the sales team to explore, visualize and predict profitable segments of the market. Danny and Shiva improved Vungle’s lifetime value (LTV) models through the use of libraries such as XGBoost, SKlearn, and Spark ML.

Wikia

Our Team: Albert Ma

Goal: Implement recommendation system using collaborative filtering on users and wikis

Wikia has recommended wikis at the bottom of their pages selected from a custom list. Albert wanted to generate recommendations using collaborative filtering instead of manually picking which wikis to be shown to users. He implemented a recommendation system using alternating least squares and matrix factorization in Python using packages numpy and pandas. He generated baseline models with random guess and recommending popular wikis to evaluate model performance. He was able to reduce miss rate by 25% and 20% compared to random guess and popular wikis respectively.

Williams-Sonoma

Our Team: Maxine Qian, Zainab Danish

Goal 1: Build machine learning models to predict prospects of Open Kitchen products

Goal 2: Incorporate a new variable ‘Clumpiness' and gauge whether it is a valuable addition to the RFM framework through modeling

Maxine and Zainab worked with the analytics team to generate consumer insights for a new brand and built models to predict the likelihood of a user purchasing the new brand. They extracted features using SQL, conducted exploratory data analysis in R, and performed feature engineering and built classification models using Python. The model was used to decide which customers to target for the new brand. Also, they used machine learning methods to gauge whether ‘Clumpiness’ variable it is a valuable addition to the RFM framework.

Xambala

Our Team: Valentin Vrzheshch

Goal: Build a tool to benchmark the company’s trading performance

Valentin wrote Python scripts that compute indicators of high-frequency trading costs for each order (metrics such as implementation shortfall, toxicity and market impact) using pandas and psql. Valentin also developed dashboards with visualizations of the indicators using R (shiny, plotly, googlevis) for easy and fast assessment of algorithm’s performance.

Xoom

Our Team: April Liu

Goal 1: Build a fraud detection model

April evaluated lasso logistic regression, random forest, AdaBoost/gradient boosting and built a final model that improved baseline F0.5 score by 35%. She evaluated various performance metrics based on the company’s business model and performed correlation analyses and data transformations.

Goal 2: Pipeline for feature importance analysis

April designed and built different Cassandra DBs to house 100 GBs of data. She imputed missing signals in data used to train risk profile models and implemented hashing method to impute signals on a rolling basis. She also assessed feature importances via random forest classifier and extracted over 100 signals from over 200 GBs of data to identify fraud using Python.

Goal 3: Use one set of signals to replace existing expedite rule to achieve higher performance in filtering out high risk transactions while maintaining high proportion of the expedited transactions

April performed advanced analytics and exploratory data analysis in two different payment source signals to come up a solution to improve the current payment expedite rule.

Our Team: Alice Zhao

Goal: Build fraud detection models

Alice worked on building fraud detection models for Non-Sufficient Fund Fraud which is a difficult problem due to highly imbalanced dataset. Alice tried different sampling methods, cost-sensitive algorithms as well as hashing tricks to build a good model using Python, R and SQL.

Our Team: Jinxin Ma, Alice Zhao

Goal: Compare feature importance measures and build a better fraud detection model

Jinxin and Alice worked under the Risk Data team at Xoom. They compared feature importance measures from random forest, gradient boosting, and extra trees and predicted whether transaction is fraudulent using a logistic regression using Python, R, and Tableau.

Our Team: Jinxin Ma

Goal: Impute missing values for transactions and re-evaluated fraud detection rules

Jinxin created his own database using PostgreSQL and wrote efficient queries to impute missing values for millions of transactions. The imputation improved the performance of fraud detection rules.

2016

Airbnb

Our Team: Ben Miroglio and Chhavi Choudhry

Goal: cluster web sessions to segment users and improve the flow of Airbnb's website and mobile app

Ben and Chhavi employed machine learning techniques to identify features indicative of positive outcomes using R and Python. They built interactive web session visualizer using D3.js to identify key differences among different segments of users and to identify bottlenecks in the session journey.

BeeBell

Our Team: Paul Thompson and Jacob Pollard

Goal: cluster users based on factors influencing event preferences and classify event category based on the event description

Paul and Jake applied the ROCK hierarchical clustering algorithm to event data in Python and clustered users based on the events they attended. They also implemented a Naive Bayes classifier using only base Python data structures, employing 5-fold cross validation, resulting in 75% mean accuracy.

Capital One Labs

Our Team: Vincent Pham and Brynne Lycette

Goal: employ machine learning techniques for credit card fraud detection and build a data unification platform

Capital One's fraud team has collected and built more than two hundred features relevant to classifying fraudulent credit card transactions. Vincent and Bree employed various machine learning techniques using H2O and Dato in order to evaluate software robustness and increase accuracy of fraud prediction. They also implemented a NoSQL data store and a higher level in-memory storage system to unify various streaming and batch processes.

ChannelMeter

Our Team: Ghizlaine Bennani and Mrunmayee Bhagwat

Goal: cluster similar YouTube channels

Ghizlaine and Mrun developed an algorithm using supervised and unsupervised machine learning techniques using Python and PostgreSQL to cluster similar YouTube channels and videos based on performance and content metrics. This clustering resulted in personalized targeting for multi-channel YouTube content providers. They also modeled median views for individual YouTube videos in their first week using regression analysis.

City of Hope

Our Team: Isabelle Litton

Goal: extract sites of cancer recurrence in clinical notes

Isabelle leveraged natural language processing using Linguamatics to capture cancer recurrence sites from clinical and radiology notes. She also automated the results-validation process with Python, saving approximately two hours per validation.  

Clorox

Our Team: Tate Campbell and Sharon Wang

Goal: model consumer churn for a product loyalty program that identifies key features contributing to customer retention rates

Tate and Sharon used PySpark to extract relevant data and perform feature engineering on more than 10 GB of data. A random forest classification model was built using Python to predict the amount of time consumers would stay actively enrolled in the program.

Dictionary.com

Our Team: Miao Lu

Goal: help Dictionary.com understand super-user behavior

Miao carried out two separate projects at Dictionary.com. She created a dashboard employing sunburst diagrams for session-based visit sequence monitoring, using MapReduce (Python), Hadoop streaming and D3 (javascript, html, css). She also analyzed super-user retention using Hive.

Eventbrite

Our Team: Meg Ellis and Jack Norman

Goal: create a price-suggestion model to assist event organizers in optimizing ticket sales and revenue

Identifying important features that most influence ticket prices, Meg and Jack implemented a K Nearest Neighbors model that clusters events with similar characteristics, and subsequently leveraged the distribution of costs of these similar, successful events to suggest an appropriate range of ticket prices that the organizer can use when creating their event. Flask was subsequently used to create a web application to allow users to interact with the model.

First Republic Bank

Our Team: Piyush Bhargava, Felipe Chamma and Harry O'Reilly

Goal: optimize liquidity allocation

Felipe and Harry used SQL Server Piyush to developed a new process to optimize cash allocation for the liquidity buffer by leveraging client transaction data and end of day positions. The process was automated, now running on a daily basis to improve performance. An SQL-based tool was also designed to detect unusual customer transaction patterns and alert bank representatives to mitigate liquidity risks. The loan-level loss allocation process was also automated to support capital stress testing required by new financial regulations.

FLYR

Our Team: Matthew Leach

Goal: investigate customers who claim FLYR's FareKeep product allows customers to lock in a flight price for up to 7 days

Matthew investigated how to better identify which customers were more likely to make a FareKeep claim using clustering techniques to group customer segments, and employing logistic regressions to predict claim rate. He additionally used Bokeh to create a dashboard displaying factors which influence the claim rate.

Juvo

Our team: Ghizlaine Bennani

Goal: build a dashboard to visualize JUVO data in an interactive fashion

Built a dashboard using Flask app through python to render an interactive D3 visualization dashboard that synchronizes real time data from redshift. The purpose of the dashboard is to help customers understand JUVO business and visualize data to help business unit in their decision making process.

Juvo

Our Team: David Wen, Jaime Pastor

Goal: create a dashboard to analyze user retention and build customer credit profiles

David built a dashboard using JavaScript and Flask, and architected Juvo's internal data flow to supply data to the dashboard using Airflow. He also built a classifier to predict which users would be retained. Using PostgreSQL and Python (pandas, scikit-learn), Jaime explored Juvo's dataset, engineered features and tested different machine learning algorithms to predict which users will pay back a loan based on their mobile behavior.

Lawfty

Our Team: Tate Campbell, Sharon Wang

Goal: build a machine learning model to predict inquiry frequency to optimize bid prices for Good AdWords

Tate and Sharon used k-means clustering to clusters hours of the day which have similar numbers of impressions and clicks, then employed a random forest to quantify the relationship between cost and inquiries. They also built an optimization algorithm to search for the best combination of daily bid prices based on the predicted number of inquiries and budget constraints.

Metromile

Our Team: Gabrielle Corbett and Jason Helgren

Goal: validate the utility of an app-based street sweeping alert

Metromile provides an app with convenience features including fuel economy analysis, engine monitoring, and street sweeping alerts. Gabby and Jason developed a logistic regression model using Python and PostgreSQL to predict driver behavior based on past driving habits and whether the driver received a street sweeping alert.

NETBASE

Our Team: Mrunmayee H. Bhagwat

Goal: build a time series forecasting model

Using NLP techniques Mrunmayee categorized Walmart’s unstructured social media data and modeled their social buzz using a generalized linear model. She also performed time series analysis on the data to identify seasonal fluctuations and trend in Walmart’s quarterly revenue and built a revenue forecast model using a SARIMA approach in R.

Radium One

Our Team: Kirk Hunter

Goal: improve mobile targeting and optimize bid pricing in the programmatic advertising space

Programmatic advertising takes place on desktop and mobile devices and the space can yield billions of data points per day. Kirk interacted with extremely large datasets using the distributed computing framework Apache Hive. He built machine learning models using Python’s scikit-learn library that identified mobile users who are most likely to lead to a successful outcome for the company.

Stanford University Graduate School of Business

Our Team: Alex Morris

Goal: develop a structured method by which to extract data from SEC EDGAR filings

Alex assisted in the development of SecParser Python package and implemented a data pipeline which extracts key data for analysis from unstructured SEC EDGAR form filings (Form 4). He subsequently carried out various machine learning and regression techniques on the parsed data to identify which filings have a significant impact on the issuers stock performance following large insider transactions.

Summit Public Schools

Our Team: Jaclyn Nguyen

Goal: identify achievement gaps and calibrate grading system

Jaclyn used national MAP assessments and statistical analysis to confirm that students met their projected assessment growth independent of race and socioeconomic group. She additionally conducted an analysis on whether teachers graded consistently across grade levels and subjects, and provided a regression-based recalibration technique.

UCSF

Our Team: Alex Romriell and Swetha Venkata Reddy

Goal: create a model to detect conjunctivitis outbreaks

Alex and Swetha performed text and geo-spatial analysis on over 300,000+ tweets to detect local pinkeye outbreaks. They created a framework for identifying tweets directly related to conjunctivitis. Surges in outbreaks were mapped to clinical records nationwide. Time series analysis of the tweets revealed similar trends and seasonality compared to the actual hospital data. Text analysis techniques such as like Latent Symantec analysis on AWS were employed to filter noise from the data. A multinomial Naive Bayes model was also developed based on TFIDF scores of the tweets to predict the sentiment.

University of San Francisco, Advancement Office

Our Team: Jacob Pollard

Goal: identify potential donors

Using Random Forest models in R, Jacob selected 10 among 70 total variables in the USF alumni donor database that had the strongest influence on predicting a potential donor. These variables were employed in building an ensemble of logistic regression classifiers with bagging. This method was subsequently implemented in Python with the help of the Scikit-Learn, Numpy, and Pandas.

Upwork

Our Team: Paul Thompson

Goal: improve predictions of freelancer job performance ratings

Paul created classification models using LSTM and GRU Neural Networks, Word2Vec and Doc2Vec, and TF/IDF in conjunction with machine learning algorithms such as random forest and support vector machines. He used python scikit-learn, gensim, and keras and ran models both locally and on AWS EC2 clusters (using CloudML) and EC2 GPU (using ssh), succeeding in improving upon the existing production model.

Voodoo Sports

Our Team: Ryan Speed

Goal: develop a framework to support processing and analysis of NBA player tracking data

Ryan built a MapReduce framework in Python to support creation of descriptive and predictive models, and generate a set of analytics output for SportVu NBA Player Tracking Data which generates 800,000 data points per game. Clustering an similarity scoring was conducted on player location distributions, court spacing, and distance traveled across multiple 55GB datasets.

Vivid Vision

Our Team: Alex Romriell and Swetha Venkata Reddy

Goal: determine the efficacy of specialized OcculusRift eye treatment software

Alex and Swetha optimized the ETL (extract, transform, load) processes using Python and PostgreSQL to better ingest user game-play data. SQL was used to query the dataset and retrieve key eye metrics from the game logs. They confirmed that the treatment correctly targeted the weak eye. They also created a D3.js dashboard to visual patterns and behaviors of users’ gaming sessions.

Vungle

Our Team: Chhavi Choudhury, Yikai Wang and Wanyan Xie

Goal: develop a User Conversion Rate Model and a User Lifetime Value Model as a mobile app advertising company

Vungle built an ad recommendation system based on a user-conversion rate prediction model with ad view-install data. Wanyan implemented a factorization machine model to quickly calculate the weights of interaction terms in logistic regression model and Yikai ran a test-performance-based feature selection algorithm to select features in current model and also implemented a gradient boosting tree model. Using Python and Spark, they were able to improve both the efficiency and accuracy of predictions. In an attempt to build user lifetime value (LTV)-related models, Chhavi, Yikai and Wanyan identified germane user-related features and developed various models to predict active user days and 7-day revenue across different advertisers.

Wikia

Our Team: Isabelle Litton

Goal: automate content tagging process to improve ad targeting

Isabelle trained multiple classifiers using Python’s Sklearn package and tfidf features to achieve an overall accuracy 86%.

Williams-Sonoma, Inc.

Our Team: Jaclyn Nguyen, Sakshi Bhargava, and Henry Tom

Goal: develop an automatic image selection and image tagging algorithm

Williams-Sonoma’s product feed contains a tremendous amount of image data, and it is difficult to automatically tag images for both analysis and product recommendations. Jaclyn, Sakshi and Henry successfully automated this process through the use of custom built ensemble image processing algorithms, superpixels, and other recent imaging advances, achieving an accuracy of 99% between silhouette/product images. They also developed a color prediction and color labeling algorithm with 90% accuracy.

Wiser

Our Team: Erica Lee and Binjie Lai

Goal: improve and test dynamic pricing strategies

Wiser provides pricing strategies for e-commerce retailers to optimize revenue. Binjie applied tuned ridge regression and time series models for prediction, designing experiments and testing results, improving upon the current prediction model by 15%. Erica employed a linear mixed effects model to measure the effectiveness of the dynamic pricing engine, using technologies which included Python, Spark, and PostgresSQL, as well as working on big data platforms Amazon Redshift and Databricks.

Womply

Our Team: Felipe Formenti Ferreira

Goal: enhance Womply's Statboard metrics, evaluate customer churn, and identify high-value clients

Using client's daily revenue data, Felipe used python to parse the transaction history and provide business insights such as growth rate and average transaction size. The Statboard is now live. Felipe also used principal component analysis to eliminate any correlation between predictor variables and identify influential factors contributing to the customer retention.

2015

AutoGrid

Our Team: Brian Kui and Tunc Yilmaz

Goal: implement generalized linear models and neural network models to improve existing load forecasting models

AutoGrid helps industrial customers shed power by controlling the operation power consuming devices such as water heaters. The team evaluated modifications to the forecasting models proposed by the data science team in order to help AutoGrid decide whether it is feasible to incorporate the modifications in the production code. They analyzed signals received, load, and the state of the water heaters, and identified errors in operation.

ChannelMeter

Our Team: Cody Wild

Goal: provide a means for ChannelMeter to leverage its 300,000-channel database to identify close niche competitors for product's subscribers

Cody utilized clustering and topic modeling, with a Mongo and Postgres backend, to construct a channel similarity metric that utilizes patterns of word reoccurrence to identify nearest neighbors in content space.

Clorox

Our Team: Kailey Hoo, Griffin Okamoto, and Ken Simonds

Goal: mine actionable insights from over 20,000 online product reviews using text analytics techniques in Python and R

The team quantified consumer opinions about a variety of product attributes for multiple brands to assess brand strengths and weaknesses.

Convergence Investment Management

Our Team: Matt Shadish

Goal: apply machine learning techniques to improve an existing trading strategy

Matt used Python and pandas to incorporate external variables and build cross-sectional models to a time series problem. He also created visualizations of current trading strategy performance using ggplot2 in R.

Danaher Labs

Our Team: Brian Kui and Tunc Yilmaz

Goal: query time-series printer data that is highly unbalanced: less than 200 faults within two million time records

Brian and Tunc applied machine learning algorithms to predict rare failures of industrial printers in order to find a model to implement in production for real-time predictions.

Dictionary.com

Our Team: Alice Benziger

Goal: create a popularity index for Dictionary.com’s Word of the Day feature based on user engagement data, such as page views (on mobile and desktop applications), email click-through rates, and social media (Facebook, Instagram, and Twitter) interactions

Alice applied machine learning techniques to implement a model to predict the popularity score of new words to optimize user engagement.

Engage3

Our Team: Matt Shadish

Goal: perform analysis of historical retail product prices across stores

Using Python Matt created visualizations of these analyses in Matplotlib. He then applied the analysis as a functional solution (using RDD’s and DataFrames) so as to take advantage of Apache Spark. This enabled an analysis of billions of price history records in a reasonable amount of time.

Fandor

Our Team: Steven Chu

Goal: define, calculate, and analyze product features, user lifetime value, user behavior, and film success metrics

As Fandor is a subscription-based model, their focus is to bring in more subscribers and retain current subscribers. There is a lot of potential to use metrics to segment as well as run predictions for users. Currently, one of these metrics (film score) is in production as a time-series visualization for stakeholders to see and utilize in their own decision-making processes.

Flyr

Our Team: Florian Burgos and Dan Loman

Goal: use machine learning to predict the price of connecting flights based on the price of the one-way tickets

Florian and Dan improved user engagement on the website by displaying content on the landing page with d3. They also computed content overnight using distributed computing on an AWS ec2 instance to find the best deals in the U.S. by origin.

GE Software

Our Team: Chandrashekar Konda

Goal: solve parts normalization and payment terms mapping tasks

Using Hadoop and Elastic search, Chandrashekar identified similar mechanical parts out of five million parts in oil rig design versions for GE Oil & Gas business.

In a separate project Chandrashekar using Python and Talend, we identified the best payment terms from one million payment terms across GE’s different businesses.
Sourcing: Using Python, we compared over 1.8 million purchase transactions with 50,000 of GE’s products to ascertain whether GE can benefit if all materials are procured from other GE subsidiaries.

Google

Our Team: Sandeep Vanga

Goal: perform unsupervised text clustering to gain insights into representative sub-topics

Sandeep built a baseline model using Kmeans clustering and tfidf features. He also devised two variants of Word2Vec (deep learning-based features) models. The first method is based on aggregation of word vectors and the second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year. Various quantitative metrics such as entropy, silhouette, score, etc. and visualization techniques were used to validate the algorithms.

Lawfty

Our Team: Brendan Herger

Goal: study existing data stream to drive business decisions, and optimized data extract-transform-load process to enable future insightful real-time data analysis

Though Lawfty’s existing pipeline had substantial outage periods and largely unvalidated data, Brendan was able to support creating a new a Spanish language vertical, creating near-real-time facilities, and contribute to better targeting AdWords campaigns.

LiveCareer.com

Our Team: Fletcher Stump Smith

Goal: perform natural language processing (NLP) and document classification using Naive Bayes with scikit-learn and sparse vector representations (Scipy).

Fletcher wrote code to store and process text data, using Python and SQLite. He performed continuous testing and refactoring of existing data science code. All of this went towards building a framework for finding words relevant to specific jobs.

Los Angeles County

Our Team: Michaela Hull

Goal: find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction

Michaela employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and a variety of machine learning techniques.

myFitnessPal.com

Our Team: Layla Martin and Patrick Howell

Goal: develop a machine learning model to predict a flavor label for every food in MyFitnessPal’s database

Using primarily Python and SQL, the team built a data pipeline to better deliver subscription numbers and revenue to business intelligence units within UnderArmour.

Ouiota

Our Team: Leighton Dong

Goal: build consumer credit default risk models to support clients in managing investment portfolios

Leighton prototyped a methodology to measure default risk using survival analysis and a cox proportional hazard model. He developed an automated process to comprehensively collect company information using Crunch Base API and store them in a NoSQL database. Leighton also engineered datasets to discover potential clients for analytics products (such as retail pricing optimization) and collected company names and other text features from Bing search result pages automatically.

Revup

Our Team: Brendan Herger

Goal: build out multiple data pipelines and utilize machine learning to help drive REVUP's beta product

Brendan was able to create three new data streams which were directly put into production. Furthermore, he utilized natural language processing and machine learning to validate and parse mechanical turk output. Finally, he utilized spectral clustering to identify individual’s political affiliation from Federal Elections Commission data.

Stella & Dot

Our Team: Rashmi Laddha

Goal: build a predictive model for revenue forecasting based on stylist’s cohort behavior

Rashmi clustered stylists’ micro-segments by analyzing their behavior in initial days of joining the company and used k-means clustering on three parameters to cluster stylists. She then built a forecast model for each micro-segment in R using HoltWinters filtering and ARIMA, tuning the model to get an error rate within 5%. She also performed sensitivity analyses around changing early performance drivers in stylist’s life cycle.

Summit Public Schools

Our Team: Griffin Okamoto and Scott Kellert

Goal: demonstrate the efficacy of online content assessments.

Griffin and Scott demonstrated the efficacy of Summit's online content assessments by using student scores on the assessments and demographic information to predict standardized test scores. They developed a linear regression model using R and ggplot2 and presented results and recommendations for Summit's teaching model to the Information Team.

Uber

Our Team: David Reilly

Goal: examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R

David constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. He analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.

USF Center for Institutional Planning and Effectiveness

Our Team: Layla Martin and Leighton Dong

Goal: analyze influential factors in USF undergraduate student retention using logistic regression models

The team predicted students' decisions to withdraw, continue, or graduate from USF by leveraging machine learning techniques in R. These insights have been used to improve institutional budget planning.

Williams-Sonoma

Our Team: Sandeep Vanga and Rachan Bassi

Goal: automate the process of image tagging by employing image processing as well as machine learning tools

Williams-Sonoma’s product feed contains more than a million images and the corresponding meta data — such as color, pattern, type of image (catalog/multiproduct/single-product) — is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram-based computer vision techniques to segment and identify important regions/features of an image. A decision tree-based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/single-product images and 70% accuracy in case of complex multiproduct/catalog images.

Xambala

Our Team: Luba Gloukhova

Goal: quantify the performance of an underlying high frequency trading strategy

Luba expanded existing internal database with data sources from Bloomberg Terminal, enabling deeper understanding of symbol characteristics underlying strategy performance. She also identified discrepancies in an end-of-day trading analysis database.

Zephyr Health

Our Team: Daniel Kuo

Goal: develop a supervised machine learning algorithm for a Publication Authorship Linkage project to determine whether multiple publications are co-referred to the same authors

Via Zephyr's DMP system, the algorithm leverages the existing institution-to-institution record linkage to easily augment new attributes and features into models. The modeling techniques used in this project include logistic regression, decision trees, and adaboost. The team used the first two algorithms to perform feature selections and then used the adaboost to improve performance.

Our Team: Monica Meyer and Jeff Baker

Goal: Develop a classification algorithm/model for the Disease Area Relevancy project that would predict and score how related a given document was to a specified disease area.

The model provides Zephyr the ability to quickly score and collect documents, as they relate to a disease, to provide resulting documents to clients. Our team explored four different algorithms to address this problem: logistic regression, bagged logistic, naïve Bayes, and random forest. Both binary and multi-label approaches were tested. The approach is scalable to include other document types.

Our Team: WeiWei Zhang

Goal: Determine disease area relevancy for medical journals using machine learning techniques.

The project began with data sampling from the PubMed database. Through natural language processing and feature engineering process, the text of abstract and title of medical documents were transformed into tokens with TF-IDF (Term Frequency, Inverse Document Frequency) scores. By leveraging the characteristics of a random forest classifier, the most important features from the feature space were selected. The body of the model was a multi-label logistic regression. The results were evaluated based on the accuracy, recall, precision, and F1 score. In short, the project is a great example of handling unlabeled data, imbalanced classes, and multi-label problems in a machine learning context.

Contact Info

MS in Data Science

Kirsten Keihl, Program Manager

USF Downtown Campus, 101 Howard Street, San Francisco, CA 94105 (415) 422-2966