Past Projects

The MS in Data Science (MSDS) sends internship teams to companies tackling data science problems in industries ranging from high-frequency trading to hospitality and energy efficiency to transportation.

Our partner organizations range from small start-ups to established Bay Area technology firms as well as large civic and nonprofit organizations.



Our Team: Ben Miroglio and Chhavi Choudhry

Goal: Cluster web sessions to segment users and improve the flow of Airbnb's website and mobile app.

Ben and Chhavi employed machine learning techniques to identify features indicative of positive outcomes using R and Python. They built interactive web session visualizer using D3.js to identify key differences among different segments of users and to identify bottlenecks in the session journey.

Capital One Labs

Our Team: Vincent Pham and Brynne Lycette

Goal: Employ machine learning techniques for credit card fraud detection and build a data unification platform.

Capital One's fraud team has collected and built more than two hundred features relevant to classifying fraudulent credit card transactions. Vincent and Bree employed various machine learning techniques using H2O and Dato in order to evaluate software robustness and increase accuracy of fraud prediction. They also implemented a NoSQL data store and a higher level in-memory storage system to unify various streaming and batch processes.


Our Team: Meg Ellis and Jack Norman

Goal: Create a price-suggestion model to assist event organizers in optimizing ticket sales and revenue.

Identifying important features that most influence ticket prices, Meg and Jack implemented a K Nearest Neighbors model that clusters events with similar characteristics, and subsequently leveraged the distribution of costs of these similar, successful events to suggest an appropriate range of ticket prices that the organizer can use when creating their event. Flask was subsequently used to create a web application to allow users to interact with the model.


Our Team: Sandeep Vanga

Goal: Perform unsupervised text clustering to gain insights into representative sub-topics.

Sandeep built a baseline model using k-means clustering and tfidf features. He also devised two variants of Word2Vec (deep learning-based features) models. The first method is based on aggregation of word vectors and the second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year. Various quantitative metrics such as entropy, silhouette, score, etc. and visualization techniques were used to validate the algorithms.


Our Team: David Reilly

Goal: Examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R.

David constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. He analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.


Our Team: Sandeep Vanga and Rachan Bassi

Goal: Automate the process of image tagging by employing image processing as well as machine learning tools.

Williams-Sonoma’s product feed contains more than a million images and the corresponding meta data — such as color, pattern, type of image (catalog/multiproduct/single-product) — is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram-based computer vision techniques to segment and identify important regions/features of an image. A decision tree-based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/single-product images and 70% accuracy in case of complex multiproduct/catalog images.

Nonprofit and Civic Organizations

Los Angeles County

Our Team: Michaela Hull

Goal: Find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction.

Michaela employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and a variety of machine learning techniques.

Summit Public Schools

Our Team: Griffin Okamoto and Scott Kellert

Goal: Demonstrate the efficacy of Summit's online content assessments by using student scores on these assessments and demographic information to predict their standardized test scores.

Griffin and Scott developed a linear regression model using R and ggplot2 and presented results and recommendations for Summit's teaching model to the Information Team.


Our Team: Alex Romriell, & Swetha Venkata Reddy

Goal: Create a model to detect conjunctivitis outbreaks.

Alex and Swetha performed text and geo-spatial analysis on over 300,000+ tweets to detect local pinkeye outbreaks. They created a framework for identifying tweets directly related to conjunctivitis. Surges in outbreaks were mapped to clinical records nationwide. Time series analysis of the tweets revealed similar trends and seasonality compared to the actual hospital data. Text analysis techniques such as like Latent Symantec analysis on AWS were employed to filter noise from the data. A multinomial Naive Bayes model was also developed based on TFIDF scores of the tweets to predict the sentiment.