MS in Data Science
Kirsten Keihl, Assistant Director
The practicum is a required, central part of our curriculum which provides students with the opportunity to gain real world experience working with our industry partners. Each project is sponsored by a company, allowing students to work with partner companies to gain analytics experience and reconcile mathematical theory with business practice. Student groups — supervised by an MSDS faculty member — work with the practicum company to identify, define, scope, and analyze a particular business problem. All groups are additionally supported and supervised by the MSDS practicum director.
Following an initial hypothesis, students typically engage in data acquisition, exploratory data analysis, feature extraction, model development and evaluation, as well as oral and written communication of results. Class schedules are set so that students can work onsite one to two days per week.
|Select Practicum Partners|
First Republic Bank
The Coca-Cola Company
Our Team: Arda Aysu, Joshua Amunrud
Goal: Predict complex human activity using mobile device accelerometer and gyroscopic data and detect possible fraud by analyzing impression level data
Joshua and Arda researched several digital signal processing techniques and strategies for clustering for time series data. In the end, we used Python to produce a random forest model that used processed data to predict the activities. They looked at one month of impression level data and tried to identify publishers with unusual characteristics. This is an ongoing project, but tools for further exploration were built in Python.
Our Team: Cameron Carlin, Mikaela Hoffman-Stapleton
Goal: Forecast ridership by gathering and analyzing external factors relevant to ridership demand
At BART, many factors come into play regarding how, when, and where people decide to take public transportation. With recent changes in the transportation industry and growing competition, it is more critical than ever to accurately forecast ridership to plan finances into the future. Cameron and Mikaela used R to develop a SARIMAx time series forecast model incorporating external factors and government data to determine ridership covariates. This modeling algorithm was implemented in a Shiny application to allow for a wider audience at BART to take advantage of these forecasts.
Our Team: Nick Levitt, Kyle Kovacevich
Goal: To build predictive models and ETL pipelines for a cyber security project
Nick and Kyle used a combination of advanced machine learning techniques such as Deep Learning, NLP and network analysis to find outliers and patterns in the data. Work was implemented in Python, PostgreSQL and Spark on top of Hadoop and Parquet.
Our Team: Rui Li, Elise Song
At Clorox we were presented with a business challenge of exploring important factors that correlated with short term sales fluctuations after the breakout of an event. We scraped more than 28 million news titles and 102k full articles relevant to the event, conducted sentiment analyses on 14 million Tweets and Instagram posts for feature engineering, and found significant results from regression analysis. Our presentation was well-received by the product team and the data science team.
Our Team: Keyang Zhang
Goal: Extracting key sentences and detecting signals from news
Keyang built a name entity recognition algorithm in Python to extract the main company name from news. He also used Latent Dirichlet Allocation, Word2vec and FuzzyWuzzy to perform key words and key sentences extraction. Based on the key words and key sentences, Keyang used Gaussian Naive Bayes to build classifiers to detect the signals from each piece of news.
Our Team: Dominic Vantman, Justin Midiri
Goal: Create a Python program that aggregates sales, consumer, and syndicated data into a unified information database and perform advanced analytics on promotional performance and return on investments
Dominic and Justin designed data visualization dashboards to further understand financial performance metrics, customer profitability reconciliation, and promotional trade-spend optimization tactics that yield highest growth across their robust portfolio of customers, products, and geographical markets.
Our Team: Linda Liu
Goal: Perform exploratory data analysis on different aspects of the financial market and use regression and classification methods to forecast/identify alpha
Linda leveraged ensemble method using stacking and majority vote to detect rare events in the financial market. She also wrote machine learning models that could be deployed into production.
Our Team: Claire Broad
Goal: Identify and rank new or missing words for potential inclusion in the dictionary
Claire developed an autonomous system in Python using sci-kit learn to generate a validity score for the items on the monthly list of unmatched queries on the Dictionary site. Her algorithm incorporated signals from lexical structure, query patterns, and usage on social media, and used a novel ensembling technique to mitigate the effect of noise in the training set.
Our Team: Sheri Nguyen, Keyang Zhang
Goal 1: Build an anomaly detection model that found breakages in affiliate reporting
Sheri and Keyang built an anomaly detection system using a combination of the Bayesian Change Point algorithm and the Twitter Anomaly Detection package by integrating R into their Python programs using RPY2. Their model successfully caught some major partner reporting breakages on various platforms.
Goal 2: Build a daily revenue forecasting model
Sheri and Keyang built a model to predict daily revenue. This was an important implementation for Ebates' marketing team, since their daily data funnel from affiliates were often times delayed, an accurate revenue prediction model was necessary in order to make important weekly business decisions of where to allocate marketing campaigns. They implemented a few different algorithms, including Linear Regression and Random Forest. Their model was put into production with a 5-6% error rate.
Goal 3: Build a model to find which customers should be sent reminder emails to Refer-A-Friend to Ebates
Sheri and Keyang built a model to pick out the most likely customers to refer Ebates to a friend. The Refer-A-Friend program is one of the highest revenue generating programs offered by Ebates. They first filtered out Ebates customers for fraudulent behavior and then assigned probabilities to each customer. The customers with the highest probabilities were then sent an email to remind them about the Refer-A-Friend program offered by Ebates. Sheri and Keyang's model improved Ebates' original classification model performance from 13% true positive detection to 80% true positive detection. Their model is currently being used in production.
Our Team: Kelsey MacMillan
Goal: Improve site search and recommendation by extracting high-level features from event data In addition to organizers that Eventbrite has direct relationships with, there are many individual organizers around the world who sign on to Eventbrite by themselves and create events. Sorting through this large inventory of individually organized events to match attendees to their next experience is a challenge. To help address this challenge, Kelsey implemented an unsupervised topic modeling method called Non-Negative Matrix Factorization that extracts key topic "tags" from events using the raw text from their titles and descriptions. For the always-tricky task of validating an unsupervised method, Kelsey built a different type of topic model using a probabilistic approach called Latent Dirchlet Allocation and checked for stability of the found topics across both types of models.
Our Team: Hannah Lieber
Goal: Help the sales and marketing teams better understand characteristics of their revenue generating organizers. Hannah used Hive and Python to develop a random forest model to classify what organizers are likely to churn, allowing Eventbrite to preemptively take action in retaining these organizers.
Our Team: Yige Liu, Anshika Srivastava
Goal: Deposit diversification study and network analysis
Anshika and Yige studied the volatile segments of the deposits to understand if increase in segment balance leads to moderation in volatility. They also used graph theory to design the pseudo algorithm to detect households and identify influential people in the bank’s network using the bank’s historical data. They acquired data from multiple servers and databases using SQL (SQL server) and performed the analysis using Python Pandas and Matplotlib.
Our Team: Graham McAlister, Derek Welborn, Yixin Zhang
Graham built a predictive anomaly detection algorithm by implementing an Isolation Forest in Python. The system takes predictions for search demand of flight characteristics and returns the probability that this point comes from the "normal" distribution in the data set.
Derek built a price prediction system for flights. His solution used gradient boosting implemented using SciKit-Learn in Python. The model is hosted in a flask app and is being used by the data science team at Flyr.
Yixin analyzed the tickets that Flyr's customer support deals with to help them better focus their efforts. Her work broke down which partners were most likely to create different issues and how expensive they were to deal with. She visualized this data using both Python's Matplotlib and R Shiny.
Our Team: Su Wang
Goal: Efficiently query, analyze and communicate data to answer business questions
Gyft, a wholly-owned subsidiary of First Data Corporation, is a leading digital gift card platform with a top-rated mobile gift card app for iPhone and Android. With a high-traffic websites and massive amounts of data, Gyft needs data analysts to work with cross-functional teams, efficiently query, analyze and communicate data in a fast-paced Internet business environment. A typical day for a Data Analyst intern at Gyft involves creating and maintaining KPI dashboards, working with different departments to understand business questions, coming up with the right metrics, querying the right data efficiently from database, extracting insights using various statistical techniques, and conveying the right message to the teams. SQL is used frequently, R and Python are used when statistical analysis is needed.
Our Team: Vyakhya Sachdeva, Evelyn Peng
Goal 1: Identify frequently visited places for users and predict commutes between those places
Vyakhya and Evelyn worked on 2 major projects at Home.ai. In the first project, they developed production-ready solution to identify frequently visited places based on mobile/GPS location data using DBScan and Gaussian Mixture clustering algorithms. Given the list of places learned in the previous step, they developed an algorithm to identify commutes between these places. They used logistic regression to predict users’ next destination given a departure place and time. Their implementation improved accuracy of existing system by 30%.
Goal 2: Predict states of different home devices based on sensor data from IoT devices, time, and environment factors
Their second project was related to home device automation, in which they combined time-series data from IoT devices (motion sensors, electric outlets, door locks, etc.), environmental data (temperature, time of day, etc.), and user location, and built machine learning models using neural networks to anticipate users’ needs in an autonomous home. Their model achieved overall accuracy rate of ~80%.
Our Team: Eric Lehman
Goal: Develop an automated algorithm to characterize the MLB strike zone for different counts and stances
Eric used both machine learning and analytic approaches to model the actual strike given 2015 and 2016 MLB pitch histories. Several different modeling approaches were investigated using Python's scikit-learn package, including LDA, decision trees and gradient boosting. The strike zone was characterized by finding the best superellipse which minimized the misclassification rate for a given count and batter stance (L or R). Detailed visualizations of the strike zone were created using R's Shiny package.
Our Team: Christine Chu, Erin Chinn
Goal: Predict clinical characteristics of breast cancer patients using mRNA expression levels
Precision medicine is an emerging field which decides medical treatment based on a patient’s genomic content. Given the high dimensionality and complex nature of genes and their interactions, neural networks and deep learning are well suited for predictive models involved in precision medicine. Christine and Erin built a multi-task neural network model in conjunction with a denoising autoencoder to predict clinical characteristics based on the expression levels of breast cancer patients. They used multiple libraries in Python (Numpy, Sci-kit Learn, Keras, Theano) to develop and test their models. With their model, they were able to achieve 93% and 82% accuracy in predicting two important clinical features that distinguish between different types of breast cancer cells.
Our Team: Matt McClelland
Goal 1: Modeling Los Angeles County deed transactions for accurate forecast
The goal of this project was to recreate a time series model provided to LA County by UCLA economist William Yu. The resulting ensemble model uses VAR techniques and Dynamic Regression with numerous autoregressive terms.
The nature of this project required GGplot for the purposes of validation and ease of forecast projection. The actual statistical implementation required various packages combined with custom functions to achieve ensemble forecast results. Further work needs to be done to bring this function in house as a cleaned working product
Goal 2: Cleaning, aggregating and visualizing LA County voting data
As an open-ended exploration into LA County Vote By Mail data, there was much work done to both clean and augment the data. Data augmentation was done using the Google geocoding API and R’s gmap package for lat/lon and distance queries. Voter data was also supplemented by Census data which merged at the census tract level. Using this data, we were able to generate several EDA style reports for LA County’s in-house reference. The visualization was made using Leaflet and Shiny. Currently this is to be an in-house application for LA County, however there is potential to potential to make this public facing.
Our Team: Francisco Calderon Rodriguez
Goal: Build a classification model to predict the likelihood of a customer email being a complaint
Francisco extracted email texts using Python from customer management platform that were formatted in JSON. He collected over 5,000 non-complaint emails and 260 complaint emails and applied Count-Vectorization from Scikit-Learn to generate frequencies for each of the 6,000 features. Modeled features using Logistic Regression with an L1 Penalty to perform feature selection.
Our Team: Cameron Carlin, Mikaela Hoffman-Stapleton
Goal: Understand customer sentiment and demand to gain insight into future product diversity and customer desires
At Loot Crate, a subscription-based "geek and gaming" memorabilia service, understanding what types of products consumers want and how they feel about the subscription offerings currently available is paramount to success. Cameron and Mikaela used Natural Language Processing (NLP), specifically text analyzers and sentiment analysis, to quantify customer experience and explore trends around historical consumer insights. These results were combined with Naive Bayes Classification and exploratory analysis of historical product offerings to help guide future Loot Crate product curation.
Our Team: Spencer Smith
Goal: Build a web interface for clinics to upload EEG data
Data was parsed then saved in a MongoDB. Spencer developed a new algorithm for classifying developmental trajectories of medical data. He has plans to publish this innovation.
Our Team: Connor Ameres, Andre Guimaraes Duarte
Goal: Create an interactive dashboard of Firefox crash statistics
Andre and Connor used Spark (PySpark + SparkSQL) to create an ETL pipeline that generates a tri-weekly (M, W, F) report of crash analysis on a representative 1% sample of the population from Firefox's release channel on desktop. They used Mozilla's MetricsGraphics.js library (D3) in order to produce an interactive visualization of these data.
In addition, Andre performed several ad-hoc analyses of Firefox user behavior data, such as finding a correlation between heavy users and early adopters using logistic regression. Connor also used various clustering methods and anomaly detection algorithms, like Isolation Forests, to segment users based on their corresponding engagement metrics.
Our Team: Joshua Amunrud
Goal: Predict game-level attendance for 2017 season and analyze Stubhub ticket listings over time
Game-level data includes ticket sales, day of week, opponent, promotion type, and more. A linear regression model was fit to this data in order to both predict game-by-game attendance and to quantify the change in factors. This information could be used to cluster the games for the multi-game ticket packs. Another project involved R with ggplot2 to plot ticket listings over time in order to gain an intuition for how prices change relative to gameday.
Our Team: Will Young
Goal: Predict water stress in almond trees using remotely sensed data (e.g. weather, aerial imagery)
Will used established methods in water stress management to engineer features from remote data. These features were used as input to a Gradient Boosting algorithm to predict water stress at the tree level. Other projects included a Kmeans algorithm to measure tree diameter from aerial imagery. He primarily used Python with Numpy, Sklearn, and Pandas.
Our Team: Melanie Palmer
Goal: Model purchase propensity for season ticket holders as well as third party events to increase sales leads
Melanie used Python to build a gradient boosting classifier to identify purchasers from nearly half a million records. The model was integrated into a dynamic server using Redshift and Crontab to update likelihood-to-purchase scores and predictions on a weekly basis. Data was analyzed and modeled using R and Tableau.
Our Team: Tim Zhou
Goal: Investigate the presence of various economic effects within gaming metrics
Tim wrote internal reports and dashboards of various mobile gaming metrics using R Shiny. He performed various data transformations and cleaning to prepare data for modeling.
Our Team: Lin Chen
Goal: Build a data analysis dashboard and perform feature exploration to optimize models
Lin used SQL and R to analyze some phenomenons of mobile game in-app purchase, and designed the visualization dashboard using R Shiny. She used python, R, and QGIS to explore new features from the external data, performed feature engineering and feature selection, and optimized the models. By using new features, the final model achieved a 20% lift on users’ lifetime value prediction.
Our Team: Ruixuan Zhang, Brigit Lawrence-Gomez
Goal: Develop ensemble classification model to classify book chapters and improve user reading experience
In order to ensure maximum reader satisfaction, Scribd is interested in presenting their digital reading content in the optimal way - skipping all the boring stuff at the beginning and ends of the book. In order to correctly tag their vast digital library, they rely on the power and efficiency of machine learning. We successfully applied various machine learning, feature extraction, and natural language processing tool to chapter-level book data, achieving 96% accuracy using Python, Scikit-learn, and GloVe Word Vectors.
Our Team: Sheri Nguyen
Goal: Built a model to predict the amount of days a package will be estimated to be in transit
As a shipping API platform, Shippo aims to make shipping fast, cost-effective and easy to use for its consumers. This project served to give an alternative arrival estimation in addition to the carrier's estimated delivery date. She tested a variety of models including: Random Forest, Linear Regression, Poisson Regression and Gradient Boosting. To evaluate her results, she used K-Fold Cross Validation and error metrics such as MSE and MAE.
Our Team: Jinxin Ma
Goal: Use network analysis to determine the most central customers of the bank
Jinxin used R and Python to perform network analysis using Silicon Valley Bank’s CRM data. The analysis helped determine the most central customers and thus providing the bank information on which customers to further invest in.
Our Team: Juan Pablo Oberhauser
Goal: Build a Data Acquisition pipeline for Fasta RNA sequencing files
The pipeline used tools such as pseudo-alignment and pyspark to download, reshape, and quantify genomic data. Juan Pablo built a tree-based ensemble classification program in Spark to diagnose several diseases and viral infections. He primarily used Spark, Python and AWS.
Our Team: Arda Aysu
Goal: Build a model to predict student scores on external state-mandated exams using internal metrics
After gaining familiarity with the structure of Summit's student performance data, a model was built in Python to predict their test scores. Time was also spent giving insight on data-driven approaches and helping the Summit team learn Python.
Our Team: Roger Wu
Goal: Quantify the impact of pricing and other factors on conversion
Turo is a car rental marketplace where car owners can rent out their cars to travelers. Understanding how price and other factors impact conversion is important for the success of the business. Using R and SQL, Roger quantified the impact of these factors on conversion. This was done using an observational study and a logistic regression model.
Our Team: Yichao Zhu
Goal: Explore the seasonality and regional outbreak of conjunctivitis based on social media posts
Yichao extracted tweets that mentioned conjunctivitis and all the related fields such as replies, location and time using Python, AWS and Twitter API. Social network based methods are implemented to estimate the users' location, which was not provided. She also suggested NLP (content-based method) for location estimation. She created a database to store the datasets for further usage.
Our Team: Vincent Rideout
Goal: Use Deep Learning to study the radiation therapy pretreatment process
Vincent used convolutional neural networks to predict pretreatment passing rates from images representing the radiation dosage applied to each part of the body (fluence maps). He experimented with many different neural net architectures and hyperparameters to optimize the quality of the predictions. Transfer learning was used to improve performance on a small dataset: a model trained to excel at image recognition on the ImageNet real-world image dataset was repurposed for this problem and turned out to be the best solution. The team is about to submit an academic paper based on their findings and will be presenting at the American Association of Physics in Medicine Annual Meeting & Exhibition.
Our Team: Tim Zhou, Zefeng Zhang
Goal: Leveraging utility companies' water usage data to optimize operational effectiveness and revenue sources
Tim developed a hybrid clustering algorithm using KMeans and DBSCAN to help flag potentially anomalous water meters. Tim explored using Fourier analysis to identify periodic behaviour, but eventually settled on a simple autocorrelation algorithm. Zefeng investigated correlations between gas and water usage data. Zefeng developed an anomaly detection model using lag 1 Markov Chains.
Our Team: Lawrence Barrett
Goal 1: Identify outliers in self-recorded weight data to clean datasets
Lawrence used R and a distance-based outlier detection algorithm to identify outliers in weight data. The algorithm came from an academic paper, but it was modified to work on the weight data that Vida provided.
Goal 2: Evaluate performance of a new coach matching algorithm via A/B testing
Lawrence used A/B testing in R to confirm that the new coach matching algorithm significantly improved communication between coaches and users. He also evaluated the power of the test to determine the reliability of these results.
Goal 3: Pulling data from Vida's database for company reports
Lawrence used SQL in BigQuery, Google's relational database, to pull data for reports that helped determine the effectiveness of Vida's weight loss, diabetes prevention, and blood pressure monitoring programs. This was an ongoing need throughout most of the internship since reports needed to be updated and turned in on a monthly basis. Lawrence used a combination of SQL and R to query and clean the data for these reports.
Goal 4: Testing the Vida ChatBot
Lawrence implemented many tests on functions in production that are used to understand user text input by grabbing the important words needed to respond to the user's queries. He used Python's unit testing framework to test these functions.
Our Team: Donny Chen
Goal 1: Develop recommendation systems to personalize users’ healthcare reading
Donny performed NLP topic modeling and applied information retrieval on large documents and users’ messages to engineer features. He developed recommendation systems including collaborative filtering in Python to personalize healthcare readings upon a keywords search.
Goal 2: Data pull from Google BigQuery
Donny used SQL to retrieve and aggregate data from Google BigQuery to get the data from KPI to users’ metrics. He also used R to cleanse and visualize the data for various analytical reports.
Goal 3: Design and migrate from Google BigQuery RDBMS to Neo4j graph database
Donny contributed to designing schema and importing data for healthcare readings in Neo4j graph database to help transit from a traditional relational database to an advanced and scalable graph database. Retrieving connected information in a relational database is often cumbersome since it requires lots of tables joining which can be achieved much more efficiently in a graph database.
Goal 4: Engineered and deployed webserver event tracking of changes of users in Django
Donny used Django to automate events logging to keep track of changes in users’ information. He worked in Python and contributed to the web server in production.
Our Team: Shivakanth Thudi, Danny Suh, Matthew Wang, Jennifer Zhu
Goal: Improve the revenue share model
Vungle’s goal is to evaluate the feasibility and profitability impact from engaging in a revenue share business model with advertisers. Matthew and Jennifer constructed a data extraction and processing pipeline using Python and Spark and built an interactive dashboard as well as a Machine Learning model to enable the sales team to explore, visualize and predict profitable segments of the market. Danny and Shiva improved Vungle’s lifetime value (LTV) models through the use of libraries such as XGBoost, SKlearn, and Spark ML.
Our Team: Albert Ma
Goal: Implement recommendation system using collaborative filtering on users and wikis
Wikia has recommended wikis at the bottom of their pages selected from a custom list. Albert wanted to generate recommendations using collaborative filtering instead of manually picking which wikis to be shown to users. He implemented a recommendation system using alternating least squares and matrix factorization in Python using packages numpy and pandas. He generated baseline models with random guess and recommending popular wikis to evaluate model performance. He was able to reduce miss rate by 25% and 20% compared to random guess and popular wikis respectively.
Our Team: Maxine Qian, Zainab Danish
Goal 1: Build machine learning models to predict prospects of Open Kitchen products
Goal 2: Incorporate a new variable ‘Clumpiness' and gauge whether it is a valuable addition to the RFM framework through modeling
Maxine and Zainab worked with the analytics team to generate consumer insights for a new brand and built models to predict the likelihood of a user purchasing the new brand. They extracted features using SQL, conducted exploratory data analysis in R, and performed feature engineering and built classification models using Python. The model was used to decide which customers to target for the new brand. Also, they used machine learning methods to gauge whether ‘Clumpiness’ variable it is a valuable addition to the RFM framework.
Our Team: Valentin Vrzheshch
Goal: Build a tool to benchmark the company’s trading performance
Valentin wrote Python scripts that compute indicators of high-frequency trading costs for each order (metrics such as implementation shortfall, toxicity and market impact) using pandas and psql. Valentin also developed dashboards with visualizations of the indicators using R (shiny, plotly, googlevis) for easy and fast assessment of algorithm’s performance.
Our Team: April Liu
Goal 1: Build a fraud detection model
April evaluated lasso logistic regression, random forest, AdaBoost/gradient boosting and built a final model that improved baseline F0.5 score by 35%. She evaluated various performance metrics based on the company’s business model and performed correlation analyses and data transformations.
Goal 2: Pipeline for feature importance analysis
April designed and built different Cassandra DBs to house 100 GBs of data. She imputed missing signals in data used to train risk profile models and implemented hashing method to impute signals on a rolling basis. She also assessed feature importances via random forest classifier and extracted over 100 signals from over 200 GBs of data to identify fraud using Python.
Goal 3: Use one set of signals to replace existing expedite rule to achieve higher performance in filtering out high risk transactions while maintaining high proportion of the expedited transactions
April performed advanced analytics and exploratory data analysis in two different payment source signals to come up a solution to improve the current payment expedite rule.
Our Team: Alice Zhao
Goal: Build fraud detection models
Alice worked on building fraud detection models for Non-Sufficient Fund Fraud which is a difficult problem due to highly imbalanced dataset. Alice tried different sampling methods, cost-sensitive algorithms as well as hashing tricks to build a good model using Python, R and SQL.
Our Team: Jinxin Ma, Alice Zhao
Goal: Compare feature importance measures and build a better fraud detection model
Jinxin and Alice worked under the Risk Data team at Xoom. They compared feature importance measures from random forest, gradient boosting, and extra trees and predicted whether transaction is fraudulent using a logistic regression using Python, R, and Tableau.
Our Team: Jinxin Ma
Goal: Impute missing values for transactions and re-evaluated fraud detection rules
Jinxin created his own database using PostgreSQL and wrote efficient queries to impute missing values for millions of transactions. The imputation improved the performance of fraud detection rules.
Our Team: Ben Miroglio and Chhavi Choudhry
Goal: cluster web sessions to segment users and improve the flow of Airbnb's website and mobile app
Ben and Chhavi employed machine learning techniques to identify features indicative of positive outcomes using R and Python. They built interactive web session visualizer using D3.js to identify key differences among different segments of users and to identify bottlenecks in the session journey.
Our Team: Paul Thompson and Jacob Pollard
Goal: cluster users based on factors influencing event preferences and classify event category based on the event description
Paul and Jake applied the ROCK hierarchical clustering algorithm to event data in Python and clustered users based on the events they attended. They also implemented a Naive Bayes classifier using only base Python data structures, employing 5-fold cross validation, resulting in 75% mean accuracy.
Our Team: Vincent Pham and Brynne Lycette
Goal: employ machine learning techniques for credit card fraud detection and build a data unification platform
Capital One's fraud team has collected and built more than two hundred features relevant to classifying fraudulent credit card transactions. Vincent and Bree employed various machine learning techniques using H2O and Dato in order to evaluate software robustness and increase accuracy of fraud prediction. They also implemented a NoSQL data store and a higher level in-memory storage system to unify various streaming and batch processes.
Our Team: Ghizlaine Bennani and Mrunmayee Bhagwat
Goal: cluster similar YouTube channels
Ghizlaine and Mrun developed an algorithm using supervised and unsupervised machine learning techniques using Python and PostgreSQL to cluster similar YouTube channels and videos based on performance and content metrics. This clustering resulted in personalized targeting for multi-channel YouTube content providers. They also modeled median views for individual YouTube videos in their first week using regression analysis.
Our Team: Isabelle Litton
Goal: extract sites of cancer recurrence in clinical notes
Isabelle leveraged natural language processing using Linguamatics to capture cancer recurrence sites from clinical and radiology notes. She also automated the results-validation process with Python, saving approximately two hours per validation.
Our Team: Tate Campbell and Sharon Wang
Goal: model consumer churn for a product loyalty program that identifies key features contributing to customer retention rates
Tate and Sharon used PySpark to extract relevant data and perform feature engineering on more than 10 GB of data. A random forest classification model was built using Python to predict the amount of time consumers would stay actively enrolled in the program.
Our Team: Miao Lu
Goal: help Dictionary.com understand super-user behavior
Our Team: Meg Ellis and Jack Norman
Goal: create a price-suggestion model to assist event organizers in optimizing ticket sales and revenue
Identifying important features that most influence ticket prices, Meg and Jack implemented a K Nearest Neighbors model that clusters events with similar characteristics, and subsequently leveraged the distribution of costs of these similar, successful events to suggest an appropriate range of ticket prices that the organizer can use when creating their event. Flask was subsequently used to create a web application to allow users to interact with the model.
Our Team: Piyush Bhargava, Felipe Chamma and Harry O'Reilly
Goal: optimize liquidity allocation
Felipe and Harry used SQL Server Piyush to developed a new process to optimize cash allocation for the liquidity buffer by leveraging client transaction data and end of day positions. The process was automated, now running on a daily basis to improve performance. An SQL-based tool was also designed to detect unusual customer transaction patterns and alert bank representatives to mitigate liquidity risks. The loan-level loss allocation process was also automated to support capital stress testing required by new financial regulations.
Our Team: Matthew Leach
Goal: investigate customers who claim FLYR's FareKeep product allows customers to lock in a flight price for up to 7 days
Matthew investigated how to better identify which customers were more likely to make a FareKeep claim using clustering techniques to group customer segments, and employing logistic regressions to predict claim rate. He additionally used Bokeh to create a dashboard displaying factors which influence the claim rate.
Our team: Ghizlaine Bennani
Goal: build a dashboard to visualize JUVO data in an interactive fashion
Built a dashboard using Flask app through python to render an interactive D3 visualization dashboard that synchronizes real time data from redshift. The purpose of the dashboard is to help customers understand JUVO business and visualize data to help business unit in their decision making process.
Our Team: David Wen, Jaime Pastor
Goal: create a dashboard to analyze user retention and build customer credit profiles
Our Team: Tate Campbell, Sharon Wang
Goal: build a machine learning model to predict inquiry frequency to optimize bid prices for Good AdWords
Tate and Sharon used k-means clustering to clusters hours of the day which have similar numbers of impressions and clicks, then employed a random forest to quantify the relationship between cost and inquiries. They also built an optimization algorithm to search for the best combination of daily bid prices based on the predicted number of inquiries and budget constraints.
Our Team: Gabrielle Corbett and Jason Helgren
Goal: validate the utility of an app-based street sweeping alert
Metromile provides an app with convenience features including fuel economy analysis, engine monitoring, and street sweeping alerts. Gabby and Jason developed a logistic regression model using Python and PostgreSQL to predict driver behavior based on past driving habits and whether the driver received a street sweeping alert.
Our Team: Mrunmayee H. Bhagwat
Goal: build a time series forecasting model
Using NLP techniques Mrunmayee categorized Walmart’s unstructured social media data and modeled their social buzz using a generalized linear model. She also performed time series analysis on the data to identify seasonal fluctuations and trend in Walmart’s quarterly revenue and built a revenue forecast model using a SARIMA approach in R.
Our Team: Kirk Hunter
Goal: improve mobile targeting and optimize bid pricing in the programmatic advertising space
Programmatic advertising takes place on desktop and mobile devices and the space can yield billions of data points per day. Kirk interacted with extremely large datasets using the distributed computing framework Apache Hive. He built machine learning models using Python’s scikit-learn library that identified mobile users who are most likely to lead to a successful outcome for the company.
Our Team: Alex Morris
Goal: develop a structured method by which to extract data from SEC EDGAR filings
Alex assisted in the development of SecParser Python package and implemented a data pipeline which extracts key data for analysis from unstructured SEC EDGAR form filings (Form 4). He subsequently carried out various machine learning and regression techniques on the parsed data to identify which filings have a significant impact on the issuers stock performance following large insider transactions.
Our Team: Jaclyn Nguyen
Goal: identify achievement gaps and calibrate grading system
Jaclyn used national MAP assessments and statistical analysis to confirm that students met their projected assessment growth independent of race and socioeconomic group. She additionally conducted an analysis on whether teachers graded consistently across grade levels and subjects, and provided a regression-based recalibration technique.
Our Team: Alex Romriell and Swetha Venkata Reddy
Goal: create a model to detect conjunctivitis outbreaks
Alex and Swetha performed text and geo-spatial analysis on over 300,000+ tweets to detect local pinkeye outbreaks. They created a framework for identifying tweets directly related to conjunctivitis. Surges in outbreaks were mapped to clinical records nationwide. Time series analysis of the tweets revealed similar trends and seasonality compared to the actual hospital data. Text analysis techniques such as like Latent Symantec analysis on AWS were employed to filter noise from the data. A multinomial Naive Bayes model was also developed based on TFIDF scores of the tweets to predict the sentiment.
Our Team: Jacob Pollard
Goal: identify potential donors
Using Random Forest models in R, Jacob selected 10 among 70 total variables in the USF alumni donor database that had the strongest influence on predicting a potential donor. These variables were employed in building an ensemble of logistic regression classifiers with bagging. This method was subsequently implemented in Python with the help of the Scikit-Learn, Numpy, and Pandas.
Our Team: Paul Thompson
Goal: improve predictions of freelancer job performance ratings
Paul created classification models using LSTM and GRU Neural Networks, Word2Vec and Doc2Vec, and TF/IDF in conjunction with machine learning algorithms such as random forest and support vector machines. He used python scikit-learn, gensim, and keras and ran models both locally and on AWS EC2 clusters (using CloudML) and EC2 GPU (using ssh), succeeding in improving upon the existing production model.
Our Team: Ryan Speed
Goal: develop a framework to support processing and analysis of NBA player tracking data
Ryan built a MapReduce framework in Python to support creation of descriptive and predictive models, and generate a set of analytics output for SportVu NBA Player Tracking Data which generates 800,000 data points per game. Clustering an similarity scoring was conducted on player location distributions, court spacing, and distance traveled across multiple 55GB datasets.
Our Team: Alex Romriell and Swetha Venkata Reddy
Goal: determine the efficacy of specialized OcculusRift eye treatment software
Alex and Swetha optimized the ETL (extract, transform, load) processes using Python and PostgreSQL to better ingest user game-play data. SQL was used to query the dataset and retrieve key eye metrics from the game logs. They confirmed that the treatment correctly targeted the weak eye. They also created a D3.js dashboard to visual patterns and behaviors of users’ gaming sessions.
Our Team: Chhavi Choudhury, Yikai Wang and Wanyan Xie
Goal: develop a User Conversion Rate Model and a User Lifetime Value Model as a mobile app advertising company
Vungle built an ad recommendation system based on a user-conversion rate prediction model with ad view-install data. Wanyan implemented a factorization machine model to quickly calculate the weights of interaction terms in logistic regression model and Yikai ran a test-performance-based feature selection algorithm to select features in current model and also implemented a gradient boosting tree model. Using Python and Spark, they were able to improve both the efficiency and accuracy of predictions. In an attempt to build user lifetime value (LTV)-related models, Chhavi, Yikai and Wanyan identified germane user-related features and developed various models to predict active user days and 7-day revenue across different advertisers.
Our Team: Isabelle Litton
Goal: automate content tagging process to improve ad targeting
Isabelle trained multiple classifiers using Python’s Sklearn package and tfidf features to achieve an overall accuracy 86%.
Our Team: Jaclyn Nguyen, Sakshi Bhargava, and Henry Tom
Goal: develop an automatic image selection and image tagging algorithm
Williams-Sonoma’s product feed contains a tremendous amount of image data, and it is difficult to automatically tag images for both analysis and product recommendations. Jaclyn, Sakshi and Henry successfully automated this process through the use of custom built ensemble image processing algorithms, superpixels, and other recent imaging advances, achieving an accuracy of 99% between silhouette/product images. They also developed a color prediction and color labeling algorithm with 90% accuracy.
Our Team: Erica Lee and Binjie Lai
Goal: improve and test dynamic pricing strategies
Wiser provides pricing strategies for e-commerce retailers to optimize revenue. Binjie applied tuned ridge regression and time series models for prediction, designing experiments and testing results, improving upon the current prediction model by 15%. Erica employed a linear mixed effects model to measure the effectiveness of the dynamic pricing engine, using technologies which included Python, Spark, and PostgresSQL, as well as working on big data platforms Amazon Redshift and Databricks.
Our Team: Felipe Formenti Ferreira
Goal: enhance Womply's Statboard metrics, evaluate customer churn, and identify high-value clients
Using client's daily revenue data, Felipe used python to parse the transaction history and provide business insights such as growth rate and average transaction size. The Statboard is now live. Felipe also used principal component analysis to eliminate any correlation between predictor variables and identify influential factors contributing to the customer retention.
Our Team: Brian Kui and Tunc Yilmaz
Goal: implement generalized linear models and neural network models to improve existing load forecasting models
AutoGrid helps industrial customers shed power by controlling the operation power consuming devices such as water heaters. The team evaluated modifications to the forecasting models proposed by the data science team in order to help AutoGrid decide whether it is feasible to incorporate the modifications in the production code. They analyzed signals received, load, and the state of the water heaters, and identified errors in operation.
Our Team: Cody Wild
Goal: provide a means for ChannelMeter to leverage its 300,000-channel database to identify close niche competitors for product's subscribers
Cody utilized clustering and topic modeling, with a Mongo and Postgres backend, to construct a channel similarity metric that utilizes patterns of word reoccurrence to identify nearest neighbors in content space.
Our Team: Kailey Hoo, Griffin Okamoto, and Ken Simonds
Goal: mine actionable insights from over 20,000 online product reviews using text analytics techniques in Python and R
The team quantified consumer opinions about a variety of product attributes for multiple brands to assess brand strengths and weaknesses.
Our Team: Matt Shadish
Goal: apply machine learning techniques to improve an existing trading strategy
Matt used Python and pandas to incorporate external variables and build cross-sectional models to a time series problem. He also created visualizations of current trading strategy performance using ggplot2 in R.
Our Team: Brian Kui and Tunc Yilmaz
Goal: query time-series printer data that is highly unbalanced: less than 200 faults within two million time records
Brian and Tunc applied machine learning algorithms to predict rare failures of industrial printers in order to find a model to implement in production for real-time predictions.
Our Team: Alice Benziger
Goal: create a popularity index for Dictionary.com’s Word of the Day feature based on user engagement data, such as page views (on mobile and desktop applications), email click-through rates, and social media (Facebook, Instagram, and Twitter) interactions
Alice applied machine learning techniques to implement a model to predict the popularity score of new words to optimize user engagement.
Our Team: Matt Shadish
Goal: perform analysis of historical retail product prices across stores
Using Python Matt created visualizations of these analyses in Matplotlib. He then applied the analysis as a functional solution (using RDD’s and DataFrames) so as to take advantage of Apache Spark. This enabled an analysis of billions of price history records in a reasonable amount of time.
Our Team: Steven Chu
Goal: define, calculate, and analyze product features, user lifetime value, user behavior, and film success metrics
As Fandor is a subscription-based model, their focus is to bring in more subscribers and retain current subscribers. There is a lot of potential to use metrics to segment as well as run predictions for users. Currently, one of these metrics (film score) is in production as a time-series visualization for stakeholders to see and utilize in their own decision-making processes.
Our Team: Florian Burgos and Dan Loman
Goal: use machine learning to predict the price of connecting flights based on the price of the one-way tickets
Florian and Dan improved user engagement on the website by displaying content on the landing page with d3. They also computed content overnight using distributed computing on an AWS ec2 instance to find the best deals in the U.S. by origin.
Our Team: Chandrashekar Konda
Goal: solve parts normalization and payment terms mapping tasks
Using Hadoop and Elastic search, Chandrashekar identified similar mechanical parts out of five million parts in oil rig design versions for GE Oil & Gas business.
In a separate project Chandrashekar using Python and Talend, we identified the best payment terms from one million payment terms across GE’s different businesses.
Sourcing: Using Python, we compared over 1.8 million purchase transactions with 50,000 of GE’s products to ascertain whether GE can benefit if all materials are procured from other GE subsidiaries.
Our Team: Sandeep Vanga
Goal: perform unsupervised text clustering to gain insights into representative sub-topics
Sandeep built a baseline model using Kmeans clustering and tfidf features. He also devised two variants of Word2Vec (deep learning-based features) models. The first method is based on aggregation of word vectors and the second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year. Various quantitative metrics such as entropy, silhouette, score, etc. and visualization techniques were used to validate the algorithms.
Our Team: Brendan Herger
Goal: study existing data stream to drive business decisions, and optimized data extract-transform-load process to enable future insightful real-time data analysis
Though Lawfty’s existing pipeline had substantial outage periods and largely unvalidated data, Brendan was able to support creating a new a Spanish language vertical, creating near-real-time facilities, and contribute to better targeting AdWords campaigns.
Our Team: Fletcher Stump Smith
Goal: perform natural language processing (NLP) and document classification using Naive Bayes with scikit-learn and sparse vector representations (Scipy).
Fletcher wrote code to store and process text data, using Python and SQLite. He performed continuous testing and refactoring of existing data science code. All of this went towards building a framework for finding words relevant to specific jobs.
Our Team: Michaela Hull
Goal: find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction
Michaela employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and a variety of machine learning techniques.
Our Team: Layla Martin and Patrick Howell
Goal: develop a machine learning model to predict a flavor label for every food in MyFitnessPal’s database
Using primarily Python and SQL, the team built a data pipeline to better deliver subscription numbers and revenue to business intelligence units within UnderArmour.
Our Team: Leighton Dong
Goal: build consumer credit default risk models to support clients in managing investment portfolios
Leighton prototyped a methodology to measure default risk using survival analysis and a cox proportional hazard model. He developed an automated process to comprehensively collect company information using Crunch Base API and store them in a NoSQL database. Leighton also engineered datasets to discover potential clients for analytics products (such as retail pricing optimization) and collected company names and other text features from Bing search result pages automatically.
Our Team: Brendan Herger
Goal: build out multiple data pipelines and utilize machine learning to help drive REVUP's beta product
Brendan was able to create three new data streams which were directly put into production. Furthermore, he utilized natural language processing and machine learning to validate and parse mechanical turk output. Finally, he utilized spectral clustering to identify individual’s political affiliation from Federal Elections Commission data.
Our Team: Rashmi Laddha
Goal: build a predictive model for revenue forecasting based on stylist’s cohort behavior
Rashmi clustered stylists’ micro-segments by analyzing their behavior in initial days of joining the company and used k-means clustering on three parameters to cluster stylists. She then built a forecast model for each micro-segment in R using HoltWinters filtering and ARIMA, tuning the model to get an error rate within 5%. She also performed sensitivity analyses around changing early performance drivers in stylist’s life cycle.
Our Team: Griffin Okamoto and Scott Kellert
Goal: demonstrate the efficacy of online content assessments.
Griffin and Scott demonstrated the efficacy of Summit's online content assessments by using student scores on the assessments and demographic information to predict standardized test scores. They developed a linear regression model using R and ggplot2 and presented results and recommendations for Summit's teaching model to the Information Team.
Our Team: David Reilly
Goal: examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R
David constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. He analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.
Our Team: Layla Martin and Leighton Dong
Goal: analyze influential factors in USF undergraduate student retention using logistic regression models
The team predicted students' decisions to withdraw, continue, or graduate from USF by leveraging machine learning techniques in R. These insights have been used to improve institutional budget planning.
Our Team: Sandeep Vanga and Rachan Bassi
Goal: automate the process of image tagging by employing image processing as well as machine learning tools
Williams-Sonoma’s product feed contains more than a million images and the corresponding meta data — such as color, pattern, type of image (catalog/multiproduct/single-product) — is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram-based computer vision techniques to segment and identify important regions/features of an image. A decision tree-based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/single-product images and 70% accuracy in case of complex multiproduct/catalog images.
Our Team: Luba Gloukhova
Goal: quantify the performance of an underlying high frequency trading strategy
Luba expanded existing internal database with data sources from Bloomberg Terminal, enabling deeper understanding of symbol characteristics underlying strategy performance. She also identified discrepancies in an end-of-day trading analysis database.
Our Team: Daniel Kuo
Goal: develop a supervised machine learning algorithm for a Publication Authorship Linkage project to determine whether multiple publications are co-referred to the same authors
Via Zephyr's DMP system, the algorithm leverages the existing institution-to-institution record linkage to easily augment new attributes and features into models. The modeling techniques used in this project include logistic regression, decision trees, and adaboost. The team used the first two algorithms to perform feature selections and then used the adaboost to improve performance.
Our Team: Monica Meyer and Jeff Baker
Goal: Develop a classification algorithm/model for the Disease Area Relevancy project that would predict and score how related a given document was to a specified disease area.
The model provides Zephyr the ability to quickly score and collect documents, as they relate to a disease, to provide resulting documents to clients. Our team explored four different algorithms to address this problem: logistic regression, bagged logistic, naïve Bayes, and random forest. Both binary and multi-label approaches were tested. The approach is scalable to include other document types.
Our Team: WeiWei Zhang
Goal: Determine disease area relevancy for medical journals using machine learning techniques.
The project began with data sampling from the PubMed database. Through natural language processing and feature engineering process, the text of abstract and title of medical documents were transformed into tokens with TF-IDF (Term Frequency, Inverse Document Frequency) scores. By leveraging the characteristics of a random forest classifier, the most important features from the feature space were selected. The body of the model was a multi-label logistic regression. The results were evaluated based on the accuracy, recall, precision, and F1 score. In short, the project is a great example of handling unlabeled data, imbalanced classes, and multi-label problems in a machine learning context.
Kirsten Keihl, Assistant Director