Data Science MS
Kirsten Keihl, Assistant Director
All students gain real world experience for nine months of the program (15 hours/week) tackling data science and analytics problems at companies around the San Francisco Bay Area and beyond. Each year, roughly 60 companies come to pitch their projects to our students during “Pitch Day” and each student chooses their top 15 projects and companies (past and current partners include those shown below). Students with complementary strengths are matched up to form a team, such as students with statistics, economics, and computer science backgrounds. To ensure success of the practicum projects, each team is actively mentored by a faculty member who participates in weekly meetings to supervise and provide technical and mathematical expertise.
Following an initial hypothesis, students typically engage in data acquisition, exploratory data analysis, feature extraction, model development and evaluation, as well as oral and written communication of results. Class schedules are set so that students can work onsite one to two days per week.
A select list:
Our Team: Daren Ma, Ming-Chuan Tsai, Haree Srinivasan
Goal: Students at ABC News used Python to write a machine learning model to predict election results and used Docker and AWS to deploy the pipeline.
Our Team: Jacob Goffin
Goal: At Accountability Counsel, Jacob created web-scraping scripts in Python & Selenium to build a first-of-its-kind database of human rights complaints. He also built a document-search (using Django/ElasticSearch) on thousands of .pdf documents, allowing users to quickly find relevant human rights cases to support their research.
Our Team: Ivette Sulca, Hoda Noorian
Goal: Students at Airbnb developed an evaluation tool prototype that identifies socioeconomic bias on Airbnb algorithms and experiments. They analyzed past A/B tests and built a dashboard using Python and Superset.
Our Team: Esther Liu, Jack Dong
Goal: At Beam Solutions, students used machine learning techniques to classify transaction data and perform text clustering. They also worked on industry research and database mapping for potential new customers.
Our Team: Hannah Lyon
Goal: At Cuyana, Hannah used Markov chains to develope a data-driven marketing attribution model that informed marketing spend. She created a customer propensity model using gradient boosting to determine critical site features that were then enhanced by the digital team to improve conversion. Additionally, she combined SQL and Tableau data for ad-hoc analysis of payment methods, trained neural networks to produce product embeddings used for a recommendation system on website product pages, and modeled repeat purchaser behavior predicting second purchases.
Our Team: Maxine Liu, Zhentao Hou
Goal: Students at Eventbrite built a classifier and a deep learning model to improve event recommendations. They also researched cases for and against investing in online events from the perspectives of opportunity size, product data, and potential revenue impact. On another project, they analyzed text data with NLP libraries to identify features that are indicative of event listing quality.
Our Team: Kevin Wong
Goal: At Faire, Kevin developed a SQL-based outlier flagging mechanism. Additionally, he conducted a deep-dive analysis of the effectiveness of the Faire mobile app on retailer behavior using SQL, python, statistics, and propensity-score matching.
Our Team: Peng Liu, Wenjie Duan
Goal: Students at FLYR developed a SQL/python workflow that predicted flight revenue by finding similar flights with clustering and Random Forest models.
Our Team: Vivian Chu
Goal: Vivian worked with FracTracker on the collection and aggregation of oil and gas data for the state of California, before conducting production analysis of oil wells at the pool level. Financial data was then added to predict the status of each of the oil wells as an asset or liability.
Our Team: Kyrill Rekun, Xueying Li
Goal: At the Golden State Warriors, students used machine learning techniques to create a last-minute ticket buyer model that predicts the probability of a person being a last-minute, planner, or in-between buyer. Using the lifetimes Python package, they built a proxy lifetime value spend model for customers to aid in marketing and ticket targeting. These projects utilized tools such as Pandas, Seaborn, and sklearn.
Our Team: Peng Liu, Wenjie Duan
Goal: Students at Gore Medical developed PyTorch CNN models using the fast.ai API to detect key points in medical optical coherence tomography images, thus allowing for automated assessment of an implant. They achieved these results using transfer learning and data augmentation.
Our Team: Ariana Moncada, Matthew Sarmiento
Goal: At Hohonu at the University of Hawaii, students created a tidal forecasting pipeline that helps populate a Django web application and Plotly plots for forecasts. They clustered multiple time series datasets together to increase the performance of their multivariate time series models in R and Python.
Our Team: Bing Wang
Goal: At the Human Rights Data Analysis Group (HRDAG), Bing gleaned critical location of death information from unstructured text fields in Arabic using Google Translate and Python Pandas, adding identifiable records to Syrian conflict data. She wrote R scripts and bash Makefiles to create blocks of similar records on killings in the Sri Lankan conflict to reduce the size of search space in the semi-supervised machine learning record linkage (database de-duplication) process.
Our Team: Shreejaya Bharathan, Geoffrey Hung
Goal: Students at Manifold developed a Python library that utilizes machine learning and deep learning to solve for the parameters of dynamical systems defined by differential equations using PyTorch, Docker and MLFlow.
Our Team: Matthew King, Lin Meng
Goal: At Metromile, students created a crash classification model to predict the primary point of impact during a collision using telematics data collected from customers. On another project, they used deep learning to classify images of fraudulent cars.
Our Team: Rushil Sheth
Goal: At the New York Mets, Rushil created infield and outfield shift models using multivariate distributions, powerful classifiers (RF and XGboost) and clustering.
Our Team: Kamron Afshar, Michael Schulze
Goal: Students at MTC used deep learning to train a Neural Net Image Classifier on images of buildings to classify their use. They generated the data set using Google API. They also built a Selenium crawler data pipeline that scrapes legal codes and collected them in a Redshift database to track changes.
Our Team: Lisa Chua, Shane Buchanan
Goal: At NakedPoppy, students improved the recommendation system for new customers by incorporating content-based and collaborative filtering trained on clickstream data. They used NLP techniques to extract key aspects from Google reviews and implemented feature-based opinion mining on product reviews to assist in the scoring of new products. Later, they conducted market basket analysis on transaction data to provide customers with “pair with” recommendations and increase engagement.
Our Team: Collin Prather
Goal: At the Baltimore Orioles, Collin implemented a Deep Recurrent Survival Analysis model (LSTM in PyTorch) to predict the probability that an American League manager will remove their pitcher using in-game time series data. Another prominent project was developing a model to predict relief pitchers’ level of fatigue, then deploying a containerized (Docker) web application on AWS to host the model and explanatory visualizations to communicate the analysis to key stakeholders in the Orioles front office.
Our Team: Kathy Yi, Sean Sturtevant, Jingwen Yu, Nithish Kumar Bolleddula
Goal: Students at PG&E used SQL, Python and AWS Sagemaker to employ machine learning techniques to predict whether or not a PG&E asset is likely to experience a failure. On another project at PG&E, students built computer vision models on drone imagery to identify defects in power grid lines.
Our Team: Nicholas Parker, Mundy Reimer
Goal: Students at Phylagen worked on projects with data from microbiome samples and laboratory processes that involved software development, data analysis, and machine learning.
Our Team: Qingmengting Wang, Tian (Arthur) Qin
Goal: At PocketGems, students completed two NLP projects using LSTM and Dialogflow.
Our Team: Andrew Eaton, Xuxu Pan
Goal: Students at Propellor Health built a Random Forest model to predict how long it would take to solve a customer support ticket using word embeddings from the ticket texts and a Continuous Bag of Words (CBOW) model. They also published live dashboards with information on ticket counts and complaint rates on a Tableau Server.
Our Team: Yunzheng Zhao, Shishir Kumar
Goal: At Recology, students used linear regression to generate route statistics and service time estimation from GIS and trash collection data. They also analyzed routing data and identified anomalies in the reporting and data-capturing process.
Our Team: Kevin Loftis, Esme Luo
Goal: Students at Reddit worked on graph-based subreddit community detection. They developed a subreddit graph based on user view overlap and performed community detection on graph to cluster similar subreddits using Python and NetworkX. This doubled the subscription rate of subreddits compared to the existing system. On another project, they worked on a streaming feature extraction pipeline where they architected and developed a Flink streaming data processor in Scala using Docker, Flink, Kafka, Circle CI, and Kubernetes.
Our Team: Meng Lin, Hao Xu
Goal: At Reputation, students used entity matching in deep learning for matching addresses and performed topic modeling to analyze topic trends in reviews.
Our Team: Alaa Abdel Latif, Annette (Zijun) Lin
Goal: Students at the Salk Institute for Biological Studies built super-resolution deep learning models using fast.ai and PyTorch.
Our Team: Sunny Kwong
Goal: At Sparta Science, Sunny worked on improving the reliability of balance tests by performing multiscale entropy analysis with R and Python on force plate scans.
Our Team: Jiaqi Chen, Sakshi Singla
Goal: At Specialty's Cafe & Bakery, Jiaqi performed revenue forecasting employing time series analysis and EDA and also worked on building a recommendation engine using machine learning.
Our Team: Jingxian Li
Goal: Students at the Stanford Graduate School of Business cleaned SEC 10-K documents and built word2vec models based on this corpus. They also came up with different ways to evaluate models and learned to use the BERT model.
Our Team: Lea Genuit, Alan Flint
Goal: At Trulia, Lea employed deep learning techniques using Pytorch to identify rotated scanned documents by a factor of 90 degrees. She also implemented an improvement of the current solution (Tesseract, an OCR engine) by working on a patch of the image using Python. Then, she compared the results of Tesseract and the CNN models. On another project at Trulia, Alan built a power analysis tool in Python for Trulia's A/B testing platform. This entailed coding and deploying an ETL pipeline and designing an interactive application using Streamlit. His second project involved employing an interpretable machine learning model to identify site features that influence positive outcomes for interested home buyers.
Our Team: Dillon Quan
Goal: At TruStar, Dillon built parsers to normalize data ingested into the datalake to centralize samples into one format for predictive analytics usage downstream using Spark and Scala. His second project focused on analyzing URLs and how to generate scores to determine their level of maliciousness using Python and Pytorch.
Our Team: Qingyi Sun, Akanksha
Goal: Working with the Brain Networks Laboratory at UCSF and the Wicklow AI in Medicine Research Initiative (WAMRI), students focused on characterizing diseases, such as Autism and Alzheimer’s disease, making diagnosis and prognosis from multi-channel brain Magnetoencephalography (MEG) data. They built an LSTM (Long Short-Term Memory) model using PyTorch to analyze brain MEG data and extract information to make predictions on characteristic parameters of interest. On anohter project, they worked on pretraining 3D Convolutional Neural Networks with brain MRI data. The models were pretrained using a segmentation task.
Our Team: Linqi Sheng
Goal: Working with UCSF and the Wicklow AI in Medicine Research Initiative (WAMRI), Linqi built an LSTM (Long Short-Term Memory) model using PyTorch to analyze brain MEG data, extract information, and make predictions on characteristic parameters of interest.
Our Team: Roja Immanni
Goal: Working with the UCSF Radiation Oncology Department, Roja found that medical image datasets are fundamentally different from natural image datasets in terms of the number of available training observations and the number of classes for the classification task. She hypothesized that compared to architectures used for natural images, those needed for medical imaging can be simpler. She proposed smaller architectures and showed how they perform similarly while significantly saving training time and memory. This is joint work with Gilmer Valdes at UCSF.
Our Team: Zachary Barnes
Goal: Working with UCSF and the Wicklow AI in Medicine Research Initiative (WAMRI), Zachary used UCSF's Spark environment for EHR data to create a data set, generate labels for hospital acquired sepsis patients, and create prediction models using sklearn and Pytorch.
Our Team: Sihan Chen
Goal: Working with the Morin Lab at UCSF and the Wicklow AI in Medicine Research Initiative (WAMRI), Sihan built a 3D Residual U-net to precisely segment metastases from brain MRI images with PyTorch. He evaluated the effects of number, size, and locations of metastases on the accuracy, which has resulted in a scientific conference presentation and a manuscript and helped UCSF design a state-of-the-art model.
Our Team: Shrikar Thodla
Goal: Working with the Vasant Lab at UCSF and the Wicklow AI in Medicine Research Initiative (WAMRI), Shrikar worked on multiple projects. These included using deep learning to segment and classify medical images, attempting to generate 3D images from multiple 2D image views, leading migration of full-stack components from GCP to IBM, detecting accidental rotations in images using CNNs built in PyTorch, and optimizing code to read images from a database.
Our Team: Srikar Murali, Sean Tey
Goal: Students at United Healthcare cleaned and processed millions of insurance claims transactions with SQL and did hypothesis testing on demographics-related data. On another project, they predicted members who are likely to be hospitalized in the near future as part of a system for identifying administratively complex members with a Gradient Boosting Trees model using the CatBoost library.
Our Team: Andrew Young, Charles Siu
Goal: At Valimail, students tackled the problem of classifying a backlog of 100K+ unknown internet domains generated by Valimail Defend. They developed an end-to-end machine learning pipeline that classifies trusted domains by detecting whether they belong to low-risk categories such as real estate. The Gradient Boosting Machine (GBM) model achieved a 95%+ precision rate with test data when classifying real estate domains using Natural Language Processing (NLP) for web content analysis. On another project, they designed and implemented REST APIs using Flask in Dockerized modules in the pipeline and built web scrapers using BeautifulSoup to gather multiple external data sources for ML model training.
Our Team: Mikio Tada, Stephanie Jung
Goal: Students at Virgo developed a Python script to extract data frames from 120 hours of video. They used Google AutoML to train deep learning models to automate video recording during endoscopic medical procedures and to develop an automatic procedure type tagging system. On another porject, they built a prototype object detection tool for real-time polyp tracking during a colonoscopy using CVAT for data labeling and Google AugoML to train the deep learning model.
Our Team: Samarth Inani, Akansha Shrivastava
Goal: At Walmart Labs, students developed an image inpainting tool to remove occlusions from high-resolution furniture images using partial convolutions. They also worked on a research-oriented project to enhance the color detection algorithm to improve the accuracy of the color attribute in the product description of furniture listed on Walmart.com using Pytorch and Open-CV.
Our Team: Max Calehuff, Xintao (Todd) Zhang, Wendeng Hu
Goal: Students working with the Wicklow AI in Medicine Research Initiative (WAMRI) and MedStar Georgetown University Hospital used NLP to create an automated grading program for medical student imaging reports.
Our Team: Andy Cheon, Aakanksha Nallabothula Surya
Goal: At Zyper, students built and deployed an image classification convolutional neural network (CNN) with PyTorch to help brands efficiently recruit fans with desired aesthetic types on social media. They applied feature importance methods using machine learning in Python to identify top factors that drive engagement rates of user-generated content. They also developed a user location prediction pipeline using NLP tools (NLTK, spaCy) to improve upon the existing location predictor and discovered and visualized trends from group chat content from 15 brand communities using mainly Pandas and ggplot.
Our Team: Sankeerti Haniyur
Goal: On this project, the student employed deep learning & NLP techniques to automatically tag cybersecurity documents. She then built a named entity recognition model to detect indicators of compromise in the documents.
Our Team: Darren Thomas, Liying Li
Goal: Students employed NLP techniques in Python for name recognition and used Pytorch and an LSTM to detect fraudulent transactions. On another project, scraped data using restful API, creating an application using Flask in Python. They also applied unsupervised machine learning models to build clustering and anomaly detection models using Python.
Our Team: Benjamin Khuong, Ziqi Pan
Goal: Students worked on an object detection project to detect defects in CT scans of machine parts. Their project was focused on designing computer vision based solutions for automatic defect-detection on industrial devices. They implemented state of the art deep learning algorithms such as Faster R-CNNs, R-FCNs, and 3D convolutional neural networks.
Our Team: Wenkun Xiao, Nicole Kacirek
Goal: Students worked closely with the marketing team to optimize campaign messages by applying NLP and machine learning techniques to competitors’ product reviews and social media posts. They also built and productionised a CLTV (customer lifetime value) and revenue prediction model which was put into production.
Our Team: Brian Chivers, Evan Liu
Goal: Students developed an unsupervised learning algorithm to detect anomalies in AWS network traffic.
Our Team: Rebecca Reilly, Minchen Wang
Goal: Students focused on increasing revenue using topic modeling, employing Python and the spaCy library to discover industry relationships using advertiser behavior. They employed machine learning technologies to predict online ad prices and identify important features. On another project, they created an NLP classifier to correctly identify acceptable and appropriate sentences.
Our Team: Nan Lin, Lance Fernando
Goal: Students built machine learning models to predict the LTV (lifetime value) of customers. On another project, they deduplicated over 5 million venue addresses using fuzzy string similarity metrics and a HMM, then utilized this data to create a search ranking method to recommend venues to event creators.
Our Team: Aditi Sharma, Zhi Li
Goal: Students built a content-based recommendation system for cars and employed auction price prediction.
Our Team: Byron Han, Yuhan Wang
Goal: Students used SQL to extract data from AWS, then employed NLP techniques to build a text classification pipeline.
Our Team: Connor Swanson
Goal: The student built anomaly detection systems in Python for environmental data. He also built time series forecasting models to predict future environmental shifts and built dashboards to host their findings.
Our Team: Tyler Ursuy, Anush Kocharyan
Goal: Students classified each Kiva partner into risk categories by implementing a Random Forest risk detection model that monitors the financial, geographic, and economic information of Kiva’s global partners. They also built an interactive online dashboard to provide easy access to data analyses, data visualizations, and model predictions which will help Kiva reduce the amount of time and money spent on manually inspecting partner information and conducting scheduled in-person visits.
Our Team: Hongdou Li, Zhe Yuan
Goal: Students employed machine learning techniques to predict solar panel performance across the country and provided business inference.
Our Team: Hai Le, Jon-Ross Presta
Goal: Students automated the data generation process for a dashboard with a Python script. They also trained an NLP model which takes the subject line, information about the app that sends the email, and information about the recipient segment to predict email open rates using PyTorch. On another project, the students used Python/PyTorch to build an NLP model to predict user engagement based on message content.
Our Team: Edward Richard Owens, Prakhar Agrawal
Goal: Students created a system that optimizes the operation of HVAC systems by detecting the stabilization of building temperature from sensor data. On another project, they built a golf simulator with the model utilizing a video of a person hitting a golf ball and outputting the ball’s trajectory using machine learning and physics. They employed methods and architectures such as background removal, darknet (YOLO) and optical flow for computer vision.
Our Team: Shivee Singh, Xiao Han
Goal: Students used machine learning and deep learning to identify microplastics in the ocean water using OpenCV Python and PyTorch. Their main focus was to build object detection models trying to locate microfibers from underwater images to approximate the total volume and distribution of microfibers in the ocean.
Our Team: Christopher Olley, Wei Wei
Goal: Students used machine learning and deep learning to identify drivers based on their telematics data (speed and acceleration). On another project, the students extracted events and created features based on this data to train tree based models using Python. They extracted labelled trip data from SQL and Amazon S3 storage and built the ML/DL models to identify users using Python and SQL.
Our Team: Sarah Melancon, Brian Wright
Goal: Students used Python and Spark to combine and aggregate add-on related data from a variety of data sources into a singe data source. They also built a dashboard based on this data source using Redash. The students built an ETL pipeline that aggregated several data sources into one combined dataset.
Our Team: Jacques Sham, Quinn Keck
Goal: Students built a data lake on AWS, involving S3 and Redshift, using tools available in the market (Trifacta and Python). On another project, they analyzed Clipper and FasTrak data, tracked key performance indicators, and built dashboards. They developed machine learning and times series models to predict daily Clipper Card usage within 4%.
Our Team: Chong Geng
Our Team: Nina Hua, Donya Fozoonmayeh
Goal: Students employed machine learning for product recommendations and used PySpark to apply a model in a distributed environment. They also implemented machine learning techniques to classify skin color from an image and worked a recommendation system to improve user experience.
Our Team: Evan Calkins, Jinghui Zhao, Ran Huang
Goal: Students developed an algorithm to support targeted marketing campaigns, which identifies similar mobile users based on their location patterns. They built an n-gram language model for the African language of Wolof to improve functionality of a chatbot using Python. On another project, they calculated relative store location optimality by comparing user movements and travel patterns using a large dataset (4TB) of mobile user information processed on a 9-node Spark cluster.
Our Team: Gokul Krishna Guruswamy, Louise Lai
Goal: Students used PyTorch to train deep learning object detection and classification models to identify faults in equipment and to detect small-scale objects in millions of large drone images. They worked extensively in AWS cloud environment (EC2, S3, lambda, SageMaker, etc.) to productionize these models.
Our Team: Paul Kim, Katja Wittfoth
Goal: Students used deep learning techniques to identify different types contaminants in waste bins. They also automated identification of contaminants in complex images of waste bins by developing a multi-label image classification model using deep learning, Pytorch, Python, and AWS.
Our Team: Xu Lian, Philip Trinh
Goal: Students built a machine learning model to predict a truck's accident occurrence using Sklearn. They used data analytics and machine learning methods to provide policy recommendations on how Recology can increase safety when collection drivers are out in the city. They also merged sheets from different sources using Pandas and PySpark.
Our Team: Yixin Sun, Julia Amaya Tavares
Goal: Students built a machine learning pipeline on Airflow to estimate subreddit retention ability. They used Python spaCy package to build a small tool to extract keywords from post comments. On another project, they used TensorFlow to create a multi-label classifier for post titles, and SQL / Pandas for data acquisition and pre-processing.
Our Team: Randy Ma, Xi Yang
Goal: Students developed a review sentiment classifier using a deep learning model with LSTM and Self-Attention to improve reputation assessment (Python, PyTorch). They extracted customer concerns by building a multi-gram keyword extraction tool using syntactic dependency analysis. They also built an automated operational insight reporting tool (SQL, Python) to assess strengths & weaknesses of the client’s user experiences.
Our Team: Crystal Sun, Marwa Oussaifi
Goal: Students created web-based visualization tools for presenting the number of accessible jobs and trip patterns within San Francisco with D3.js. They automated complex data preprocessing and data pipelines to accommodate different scenarios when collecting, processing and piping the data using python. On another project, they implemented different ML algorithms to predict auto ownership per household.
Our Team: Xinran Zhang, Zitong Zeng
Goal: Students developed a Scala notebook to help the customer service team analyze user-retention metrics such as DAU and Return Retention. They provided an anonymization routine for sensitive impressions and events data using Spark UDF and Murmurhash3. They explored alternatives to traditional parametric tests to improve the performance credibility of A/B test analysis. They also researched and implemented outlier detection methods in Scala.
Our Team: Xinke Sun, Jyoti Prakash Maheswari
Goal: Students used SQL to track KPIs and built table to store daily metrics using Python. The students applied deep learning techniques to understand the content of real-estate listings consisting of images and text and to predict lead submission.
Our Team: Viviana M. Peña-Márquez, Neha Tevathia
Goal: Students built an NLP model to identify the malware names using CBOW model and leveraged the open source data from Twitter. They used Pytorch to build the CBOW model. Created and implemented pipeline to automatically collect tweets using Twitter’s API, applied machine learning and natural language processing algorithms to detect entities, and feed daily detections to a dashboard.
Our Team: Tian Qi, Jessica Wang
Goal: The students deployed a machine learning pipeline to predict the paid users within the next two weeks using Python and SQL. In another project, the students predicted short term purchase using Python.
Our Team: Jenny Kong
Goal: The student used machine learning with fMRI data to classify network patterns of concurrently activating brain regions that arise during successful high-fidelity memory retrieval.
Our Team: Miguel Romero Calvo
Goal: The student employed deep learning techniques to improve the performance of Neural Networks in small data. He also conducted research on training and transfer learning methodologies.
Our Team: Anish Dalal, Robert Sandor
Goal: Students employed deep learning techniques in computer vision to accurately segment ventricles in the brain using Pytorch. On another project, they built a text classifier that predicts cancer patient survival from physician notes using Python, PyTorch, Bash, and FastAI.
Our Team: Alan Perry, Tianqi Wang
Goal: Using Python, students employed deep learning techniques to make segmentation of different organs, to make dose volume diagnosis, and to achieve MRI to CT images transformation.
Our Team: Max Alfaro, Divya Bhargavi
Goal: Students built deep learning models to classify different views of echocardiograms. They performed exploratory data analysis to become familiar with medical terminology.
Our Team: Victoria Suarez, Harrison Mamin
Goal: Students built recommender system to predict which matched candidates to job posting using Python, which improved recruiters' efficiency by 56%. They researched methods of detecting unconscious gender bias in performance reviews using word embeddings and neural networks. On another project, the students worked on two approaches to extract causal language pairs from text; one using a deterministic rule-based engine and one using a neural network, integrating them into a web-based UI using Flask.
Our Team: Adam Reevesman, Meng-Ting Chang
Goal: Students built a rule-based algorithm to identify when a user finished a route but forgot to stop their tracker in the MapMyFitness app using Python. They also preformed functions related to EDA.
Our Team: Tomohiko Ishihara, Maria Vasilenko
Goal: Students gathered user reviews on Personal Health Record apps on Apple App Store and Google Play Store and used Latent Dirichlet Analysis to try to see what app features users talk about most. They built models to predict whether a member is likely to get pregnant by creating a data set, performing feature engineering and building machine learning models. On another project, they collected user reviews from GooglePlay and Appstore and performed topic modeling (LDA) as implemented in Gensim.
Our Team: Joy Qi, Jialiang Shi
Goal: Students built machine learning classification models to identify lists of legitimate email domains versus fraudulent email domains. They employed machine learning techniques to classify whether an unknown domain is trusted or untrusted. On another project, they created scraping script to scrape social links on web pages.
Our Team: Yihan Wang, Jian Wang
Goal: Students predicted water utility customer nonpayment with a Random Forest model and implemented the model in Python into Valor’s codebase. They segmented utility customers with K-means clustering to understand their behavior. On another project they applied multiple time series model for identifying malfunctioned water meters. They used SQL and Python to build end-to-end workflow for the project.
Our Team: Shulun Chen
Goal: The student used SQL, Python, and Swagger to build data pipelines.
Our Team: Ziyu Fan
Goal: The student applied data science and machine learning techniques to forecast E-commerce retailer sales using Python. On another project, she used machine learning and NLP to find anomalies in product matching.
Our Team: Brian Dorsey, Fiorella Tenorio
Goal: Students used Python, TensorFlow, and Time Series demand prediction models. They worked on a model to predict the probability of client purchases and a demand prediction model.
Our Team: Arpita Jena, Devesh Maheshwari, Alexander Howard
Goal: Students employed NLP and deep learning techniques to classify sensitive information in Capital One's internal domain using Python.The result was wrapped in a Flask web app. Another project involved software engineering with the goal of automating Capital One's AWS authentication process.
Our Team: Yiqiang Zhao, Gongting Peng
Goal: Students employed machine learning methods to build a data pipeline for anomaly detection. They also used Python for data exploration.
Our Team: Stephen Hsu
Goal: Students worked within a multidisciplinary team to offer data science services to a nonprofit organization. Specifically, students developed an NLP-based model in Python to classify forum posts so that forum questions could be appropriately matched with professionals who are best positioned to answer them.
Our Team: Timothy Lee
Goal: Students did data pipeline work using the Python API service. Their work involved classification of PDF files using Python XGBoost and the collecting of research data samples using Python.
Our Team: Holly Capell Students at Eventbrite used machine learning in Python to model ticket sell-through rates in order to help the company identify platform features that drive event sell-out. They performed cohort analyses using Python to help understand the revenue life-cycle of Eventbrite customers and investigated seasonality in ticket sales, using SQL to query data and R to create data visualizations.
Our Team: Bingyi Li, Christopher Csiszar
Goal: Students built a web-based system to classify municipal bonds in order to assure government compliance using Python and Flask. They used big data analytics, machine learning and clustering algorithms to automate the classification of the bank's municipal bond portfolio into High Quality Liquid Asset bonds. This work replaced the need for inefficient and costly external consultants to perform this task quarterly.
Our Team: Yue Lan, Akshay Tiwari
Goal: Students wrote SQL scripts to perform exploratory data analysis and built a data pipeline to ingest airline customer data. They also employed machine learning techniques to build and validate models using python to predict bookings and cancellations of airline tickets as part of the Flyr airline revenue management system They also worked on another project that used machine learning techniques to predict customer budget and price sensitivity.
Our Team: Jake Toffler
Goal: Students clustered individual pitchers' pitches by pitch type using level-set trees, a density-based clustering method, in Python.
Our Team: Shikhar Gupta, Fei Liu
Goal: Students used deep learning CNN techniques to identify diseases in chest X-rays.
Our Team: Ting Ting Liu, Jose Antonio Rodilla Xerri
Goal: Students employed machine learning techniques to identify relevant factors that may affect whether or not a Kiva loan will reach full funding. They developed a web application powered by a random forest model in order to predict the success of loans, highlight which factors are driving those loans, and provide suggestions on how to improve them.
Our Team: Vinay Patlolla, Jason Carpenter
Goal: Students worked on two projects with Manifold. In the first project, they used machine learning models such as Logistic Regression, Random Forest and XGBoost to detect faults in oil pipeline using Python. In the second project, they developed a multi-camera multitracking pipeline to track people in a scene using deep learning and clustering techniques.
Our Team: Chenxi Ge
Goal: Students worked on a complex computer vision problem using deep learning with the goal of locating characters to decode the character sequence.
Our Team: Tyler White, Jing Song
Goal: Students used Spark to obtain data to build a public-facing Firefox Health report dashboard. They used time series analysis to predict ESR usage and checked the validity of t-tests with non-parametric tests.
Our Team: Danai Avgerinou, Shannon McNish
Goal: Students worked on a data engineering project to build a small centralized data warehouse to host MTC's data. They also worked on a data science project using NLP with FastTrak survey data and made discoveries involving ridership patterns of Clipper users.
Our Team: Natalie Ha, Christopher Dong
Goal: Students built a text classification model to categorize survey responses and found correlations with NPS. On another project, they built a Tableau dashboard for funnel analysis on reported content in the platform. They also built and deployed (with Airflow) a machine learning model using Spark ML to predict survey text responses and created complex SQL queries to calculate metrics regarding content moderation.
Our Team: Guoqiang Liang
Goal: Students employed machine learning techniques to assign probabilities of churn using Python and Spark. On another project, they used NLP techniques to classify legal documents.
Our Team: Ernest Kim, Davi Alexander Schumacher
Our Team: Dixin Yan, Spencer Stanley
Goal: At Pocket Gems, students employed machine learning techniques to build a churn model and a matchmaking model for a newly developed game. They also researched and developed models to help the marketing team with channel attribution and creatives optimization. On another project, they used time series methods to predict the impact of paid advertising channels on organic install volume.
Our Team: Neerja Doshi, Alvira Swalin
Goal: Students employed machine learning (Python) and deep learning (PyTorch) techniques to build a product recommendation system.
Our Team: Khoury Ibrahim, Danielle Savage
Goal: Students used deep learning techniques to build a multi-label image recognition CNN using PyTorch to identify contaminants in images of landfill, recycling, and compost in Recology's images of waste.
Our Team: Sara Mahar, Nicha Ruchirawat
Goal: Students automated the real-time detection of a data feed failure from Google, Bing and Facebook sources using a suite of standardized hypothesis tests. On another project, they identified significant clusters of words from tens of thousands of omni-channel reviews with Latent Dirichlet Allocation (LDA) topic modeling and k-means clustering.
Our Team: Kishan Panchal
Goal: Students used machine learning techniques to create a weekly cohort-based churn prediction system for season ticket holders. On another project, they created a data ingestion system to get external ticket data into the team's data warehouse.
Our Team: John Rumpel, Kaya Tollas
Our Team: Mathew Shaw, Cara Qin
Goal: Students employed machine learning techniques to identify suspicious users, predict LTV, and classify game themes.
Our Team: Daniel Grzenda, Jade Yun
Goal: Students employed graph theory to quantify variants and analyze protein data from the blood of patients using Python.
Our Team: Nimesh Sinha, Zizhen Song
Goal: Students used natural language processing and machine learning techniques to build a data pipeline recommendation engine. On another project, they worked on clustering customers based on login data.
Our Team: Ker-Yu Ong, Chen Wang
Goal: Students compared cloud databases (AWS, Google Bigquery, Snowflake and Databricks) by running benchmarking queries for research use cases. They also ran machine learning models to classify WSJ articles and used NLP techniques to extract information from news articles and identify topics in Amazon product reviews.
Our Team: David Kes
Goal: Students developed an exponentially weighted moving average (EWMA) control charting scheme to detect bus detours for a variety of transit agencies using Python. The algorithm was used to help automate the customer success team's process for detecting defaults in any transit agencies systems.
Our Team: Thy Khue Ly, Beiming Liu
Goal: Students used machine learning to predict default risks of customers and also to cluster them into groups based on their credit card transactions using Python. On another project they used NLP to predict transaction categories, and on a final project they used time-series and machine learning to predict user annual income with transactional data.
Our Team: Feiran Ji, Lingzhi Du
Goal: Students predicted users’ purchasing behavior for future games using machine learning techniques and deployed an end-to-end pipeline to put the model into production on Hadoop clusters using Spark. Additionally, they visualized insights and developed an interactive dashboard to be used in conjunction with the predictive model.
Our Team: Siavash Mortezavi, Kerem Can Turgutlu
Goal: Students used traditional machine learning techniques to predict overall survival of meningioma cancer patients and used deep learning and computer vision to automatically segment brain structures.
Our Team: Sangyu Shen, Qian Li
Goal: Students employed machine learning techniques to classify patients with side effects from radiation therapy using Python.
Our Team: Ryan Campa, Zhengjie Xu
Goal: Students used machine learning to predict stride and cadence to help runners improve their form. They also used unsupervised learning to identify organized race events from millions of rows of workout data.
Our Team: Savannah Logan, Sooraj Mangalath Subrahmannian
Goal: Students applied NLP techniques in Python to identify the main complaints in a website survey. They then employed machine learning techniques to identify areas of possible improvement in coverage rejection time.
Our Team: Taylor Pellerin, Devin Bowers
Goal: Students employed machine learning techniques to help identify fraudulent email sending behavior. They prototyped internal tooling, documentation, and more. Additionally, they built a machine learning classifier to help identify new legitimate email services. This allows Valimail to quickly scan through email aggregate reports to identify legitimate services that email on a customer's behalf.
Our Team: Jingjue Wang, Kunal Kotian
Goal: Students trained a recurrent neural network to forecast water consumption and flagged unusual water meter readings by comparing the deviation of forecasts from true values. They wrote production code for a pipeline to extract and transform data, train deep learning models using TensorFlow, and generate forecasts for several water consumption time series.
Our Team: Nishan Madawanarachchi, Chengcheng Xu
Goal: Students predicted weight loss among customers using linear regression with R. On another project, they used logistic regression in Python to predict the urgency level of clients' messages using logistic regression in Python. They also built a chat bot which aimed to help new users with the onboarding process.
Our Team: Ford Higgins, Ian Pieter Smeenk
Goal: Students contributed to a 'football genome' project for stylistic classification of teams using Python. They built a college basketball statistical model that builds on top of existing models in order to improve them and designed tools for football coaches to use to as an aid in scouting opposing teams. These projects were completed using Python, R, SQL and D3.js.
Our Team: Deena Liz John, Patrick Yang
Goal: Students used Python, SQL and Looker to implement A:B testing at Vungle, revolving around the comparison of different ad templates, levels of compression, and more. They also aided in the development of an in-house A:B testing platform.
Our Team: Liz Chen, Yu Tian
Goal: Students developed an end-to-end pipeline in Python using computer vision and deep learning technologies for a company promotional product to recognize online promotions from images. On another project, they deployed REST APIs into production and designed experiments to compare the results from different methods.
Our Team: Vanessa Zheng
Goal: Students developed fraud detection models on a high-dimensional imbalanced dataset using Python. On another project, they devised and evaluated global risk metrics to monitor, condition and strengthen fraud models with SQL & Python.
Our Team: Sri Santhosh Hari
Goal: Students used time series techniques to forecast customer churn. Additionally, they used machine learning techniques like Random Forest and XGBoost to identify key features affecting bookings to predict members' likelihood of booking a car.
Our Team: Arda Aysu, Joshua Amunrud
Goal: Predict complex human activity using mobile device accelerometer and gyroscopic data and detect possible fraud by analyzing impression level data
Joshua and Arda researched several digital signal processing techniques and strategies for clustering for time series data. In the end, we used Python to produce a random forest model that used processed data to predict the activities. They looked at one month of impression level data and tried to identify publishers with unusual characteristics. This is an ongoing project, but tools for further exploration were built in Python.
Our Team: Cameron Carlin, Mikaela Hoffman-Stapleton
Goal: Forecast ridership by gathering and analyzing external factors relevant to ridership demand
At BART, many factors come into play regarding how, when, and where people decide to take public transportation. With recent changes in the transportation industry and growing competition, it is more critical than ever to accurately forecast ridership to plan finances into the future. Cameron and Mikaela used R to develop a SARIMAx time series forecast model incorporating external factors and government data to determine ridership covariates. This modeling algorithm was implemented in a Shiny application to allow for a wider audience at BART to take advantage of these forecasts.
Our Team: Nick Levitt, Kyle Kovacevich
Goal: To build predictive models and ETL pipelines for a cyber security project
Nick and Kyle used a combination of advanced machine learning techniques such as Deep Learning, NLP and network analysis to find outliers and patterns in the data. Work was implemented in Python, PostgreSQL and Spark on top of Hadoop and Parquet.
Our Team: Rui Li, Elise Song
At Clorox we were presented with a business challenge of exploring important factors that correlated with short term sales fluctuations after the breakout of an event. We scraped more than 28 million news titles and 102k full articles relevant to the event, conducted sentiment analyses on 14 million Tweets and Instagram posts for feature engineering, and found significant results from regression analysis. Our presentation was well-received by the product team and the data science team.
Our Team: Keyang Zhang
Goal: Extracting key sentences and detecting signals from news
Keyang built a name entity recognition algorithm in Python to extract the main company name from news. He also used Latent Dirichlet Allocation, Word2vec and FuzzyWuzzy to perform key words and key sentences extraction. Based on the key words and key sentences, Keyang used Gaussian Naive Bayes to build classifiers to detect the signals from each piece of news.
Our Team: Dominic Vantman, Justin Midiri
Goal: Create a Python program that aggregates sales, consumer, and syndicated data into a unified information database and perform advanced analytics on promotional performance and return on investments
Dominic and Justin designed data visualization dashboards to further understand financial performance metrics, customer profitability reconciliation, and promotional trade-spend optimization tactics that yield highest growth across their robust portfolio of customers, products, and geographical markets.
Our Team: Linda Liu
Goal: Perform exploratory data analysis on different aspects of the financial market and use regression and classification methods to forecast/identify alpha
Linda leveraged ensemble method using stacking and majority vote to detect rare events in the financial market. She also wrote machine learning models that could be deployed into production.
Our Team: Claire Broad
Goal: Identify and rank new or missing words for potential inclusion in the dictionary
Claire developed an autonomous system in Python using sci-kit learn to generate a validity score for the items on the monthly list of unmatched queries on the Dictionary site. Her algorithm incorporated signals from lexical structure, query patterns, and usage on social media, and used a novel ensembling technique to mitigate the effect of noise in the training set.
Our Team: Sheri Nguyen, Keyang Zhang
Goal 1: Build an anomaly detection model that found breakages in affiliate reporting
Sheri and Keyang built an anomaly detection system using a combination of the Bayesian Change Point algorithm and the Twitter Anomaly Detection package by integrating R into their Python programs using RPY2. Their model successfully caught some major partner reporting breakages on various platforms.
Goal 2: Build a daily revenue forecasting model
Sheri and Keyang built a model to predict daily revenue. This was an important implementation for Ebates' marketing team, since their daily data funnel from affiliates were often times delayed, an accurate revenue prediction model was necessary in order to make important weekly business decisions of where to allocate marketing campaigns. They implemented a few different algorithms, including Linear Regression and Random Forest. Their model was put into production with a 5-6% error rate.
Goal 3: Build a model to find which customers should be sent reminder emails to Refer-A-Friend to Ebates
Sheri and Keyang built a model to pick out the most likely customers to refer Ebates to a friend. The Refer-A-Friend program is one of the highest revenue generating programs offered by Ebates. They first filtered out Ebates customers for fraudulent behavior and then assigned probabilities to each customer. The customers with the highest probabilities were then sent an email to remind them about the Refer-A-Friend program offered by Ebates. Sheri and Keyang's model improved Ebates' original classification model performance from 13% true positive detection to 80% true positive detection. Their model is currently being used in production.
Our Team: Kelsey MacMillan
Goal: Improve site search and recommendation by extracting high-level features from event data In addition to organizers that Eventbrite has direct relationships with, there are many individual organizers around the world who sign on to Eventbrite by themselves and create events. Sorting through this large inventory of individually organized events to match attendees to their next experience is a challenge. To help address this challenge, Kelsey implemented an unsupervised topic modeling method called Non-Negative Matrix Factorization that extracts key topic "tags" from events using the raw text from their titles and descriptions. For the always-tricky task of validating an unsupervised method, Kelsey built a different type of topic model using a probabilistic approach called Latent Dirchlet Allocation and checked for stability of the found topics across both types of models.
Our Team: Hannah Lieber
Goal: Help the sales and marketing teams better understand characteristics of their revenue generating organizers. Hannah used Hive and Python to develop a random forest model to classify what organizers are likely to churn, allowing Eventbrite to preemptively take action in retaining these organizers.
Our Team: Yige Liu, Anshika Srivastava
Goal: Deposit diversification study and network analysis
Anshika and Yige studied the volatile segments of the deposits to understand if increase in segment balance leads to moderation in volatility. They also used graph theory to design the pseudo algorithm to detect households and identify influential people in the bank’s network using the bank’s historical data. They acquired data from multiple servers and databases using SQL (SQL server) and performed the analysis using Python Pandas and Matplotlib.
Our Team: Graham McAlister, Derek Welborn, Yixin Zhang
Graham built a predictive anomaly detection algorithm by implementing an Isolation Forest in Python. The system takes predictions for search demand of flight characteristics and returns the probability that this point comes from the "normal" distribution in the data set.
Derek built a price prediction system for flights. His solution used gradient boosting implemented using SciKit-Learn in Python. The model is hosted in a flask app and is being used by the data science team at Flyr.
Yixin analyzed the tickets that Flyr's customer support deals with to help them better focus their efforts. Her work broke down which partners were most likely to create different issues and how expensive they were to deal with. She visualized this data using both Python's Matplotlib and R Shiny.
Our Team: Su Wang
Goal: Efficiently query, analyze and communicate data to answer business questions
Gyft, a wholly-owned subsidiary of First Data Corporation, is a leading digital gift card platform with a top-rated mobile gift card app for iPhone and Android. With a high-traffic websites and massive amounts of data, Gyft needs data analysts to work with cross-functional teams, efficiently query, analyze and communicate data in a fast-paced Internet business environment. A typical day for a Data Analyst intern at Gyft involves creating and maintaining KPI dashboards, working with different departments to understand business questions, coming up with the right metrics, querying the right data efficiently from database, extracting insights using various statistical techniques, and conveying the right message to the teams. SQL is used frequently, R and Python are used when statistical analysis is needed.
Our Team: Vyakhya Sachdeva, Evelyn Peng
Goal 1: Identify frequently visited places for users and predict commutes between those places
Vyakhya and Evelyn worked on 2 major projects at Home.ai. In the first project, they developed production-ready solution to identify frequently visited places based on mobile/GPS location data using DBScan and Gaussian Mixture clustering algorithms. Given the list of places learned in the previous step, they developed an algorithm to identify commutes between these places. They used logistic regression to predict users’ next destination given a departure place and time. Their implementation improved accuracy of existing system by 30%.
Goal 2: Predict states of different home devices based on sensor data from IoT devices, time, and environment factors
Their second project was related to home device automation, in which they combined time-series data from IoT devices (motion sensors, electric outlets, door locks, etc.), environmental data (temperature, time of day, etc.), and user location, and built machine learning models using neural networks to anticipate users’ needs in an autonomous home. Their model achieved overall accuracy rate of ~80%.
Our Team: Eric Lehman
Goal: Develop an automated algorithm to characterize the MLB strike zone for different counts and stances
Eric used both machine learning and analytic approaches to model the actual strike given 2015 and 2016 MLB pitch histories. Several different modeling approaches were investigated using Python's scikit-learn package, including LDA, decision trees and gradient boosting. The strike zone was characterized by finding the best superellipse which minimized the misclassification rate for a given count and batter stance (L or R). Detailed visualizations of the strike zone were created using R's Shiny package.
Our Team: Christine Chu, Erin Chinn
Goal: Predict clinical characteristics of breast cancer patients using mRNA expression levels
Precision medicine is an emerging field which decides medical treatment based on a patient’s genomic content. Given the high dimensionality and complex nature of genes and their interactions, neural networks and deep learning are well suited for predictive models involved in precision medicine. Christine and Erin built a multi-task neural network model in conjunction with a denoising autoencoder to predict clinical characteristics based on the expression levels of breast cancer patients. They used multiple libraries in Python (Numpy, Sci-kit Learn, Keras, Theano) to develop and test their models. With their model, they were able to achieve 93% and 82% accuracy in predicting two important clinical features that distinguish between different types of breast cancer cells.
Our Team: Matt McClelland
Goal 1: Modeling Los Angeles County deed transactions for accurate forecast
The goal of this project was to recreate a time series model provided to LA County by UCLA economist William Yu. The resulting ensemble model uses VAR techniques and Dynamic Regression with numerous autoregressive terms.
The nature of this project required GGplot for the purposes of validation and ease of forecast projection. The actual statistical implementation required various packages combined with custom functions to achieve ensemble forecast results. Further work needs to be done to bring this function in house as a cleaned working product
Goal 2: Cleaning, aggregating and visualizing LA County voting data
As an open-ended exploration into LA County Vote By Mail data, there was much work done to both clean and augment the data. Data augmentation was done using the Google geocoding API and R’s gmap package for lat/lon and distance queries. Voter data was also supplemented by Census data which merged at the census tract level. Using this data, we were able to generate several EDA style reports for LA County’s in-house reference. The visualization was made using Leaflet and Shiny. Currently this is to be an in-house application for LA County, however there is potential to potential to make this public facing.
Our Team: Francisco Calderon Rodriguez
Goal: Build a classification model to predict the likelihood of a customer email being a complaint
Francisco extracted email texts using Python from customer management platform that were formatted in JSON. He collected over 5,000 non-complaint emails and 260 complaint emails and applied Count-Vectorization from Scikit-Learn to generate frequencies for each of the 6,000 features. Modeled features using Logistic Regression with an L1 Penalty to perform feature selection.
Our Team: Cameron Carlin, Mikaela Hoffman-Stapleton
Goal: Understand customer sentiment and demand to gain insight into future product diversity and customer desires
At Loot Crate, a subscription-based "geek and gaming" memorabilia service, understanding what types of products consumers want and how they feel about the subscription offerings currently available is paramount to success. Cameron and Mikaela used Natural Language Processing (NLP), specifically text analyzers and sentiment analysis, to quantify customer experience and explore trends around historical consumer insights. These results were combined with Naive Bayes Classification and exploratory analysis of historical product offerings to help guide future Loot Crate product curation.
Our Team: Spencer Smith
Goal: Build a web interface for clinics to upload EEG data
Data was parsed then saved in a MongoDB. Spencer developed a new algorithm for classifying developmental trajectories of medical data. He has plans to publish this innovation.
Our Team: Connor Ameres, Andre Guimaraes Duarte
Goal: Create an interactive dashboard of Firefox crash statistics
Andre and Connor used Spark (PySpark + SparkSQL) to create an ETL pipeline that generates a tri-weekly (M, W, F) report of crash analysis on a representative 1% sample of the population from Firefox's release channel on desktop. They used Mozilla's MetricsGraphics.js library (D3) in order to produce an interactive visualization of these data.
In addition, Andre performed several ad-hoc analyses of Firefox user behavior data, such as finding a correlation between heavy users and early adopters using logistic regression. Connor also used various clustering methods and anomaly detection algorithms, like Isolation Forests, to segment users based on their corresponding engagement metrics.
Our Team: Joshua Amunrud
Goal: Predict game-level attendance for 2017 season and analyze Stubhub ticket listings over time
Game-level data includes ticket sales, day of week, opponent, promotion type, and more. A linear regression model was fit to this data in order to both predict game-by-game attendance and to quantify the change in factors. This information could be used to cluster the games for the multi-game ticket packs. Another project involved R with ggplot2 to plot ticket listings over time in order to gain an intuition for how prices change relative to gameday.
Our Team: Will Young
Goal: Predict water stress in almond trees using remotely sensed data (e.g. weather, aerial imagery)
Will used established methods in water stress management to engineer features from remote data. These features were used as input to a Gradient Boosting algorithm to predict water stress at the tree level. Other projects included a Kmeans algorithm to measure tree diameter from aerial imagery. He primarily used Python with Numpy, Sklearn, and Pandas.
Our Team: Melanie Palmer
Goal: Model purchase propensity for season ticket holders as well as third party events to increase sales leads
Melanie used Python to build a gradient boosting classifier to identify purchasers from nearly half a million records. The model was integrated into a dynamic server using Redshift and Crontab to update likelihood-to-purchase scores and predictions on a weekly basis. Data was analyzed and modeled using R and Tableau.
Our Team: Tim Zhou
Goal: Investigate the presence of various economic effects within gaming metrics
Tim wrote internal reports and dashboards of various mobile gaming metrics using R Shiny. He performed various data transformations and cleaning to prepare data for modeling.
Our Team: Lin Chen
Goal: Build a data analysis dashboard and perform feature exploration to optimize models
Lin used SQL and R to analyze some phenomenons of mobile game in-app purchase, and designed the visualization dashboard using R Shiny. She used python, R, and QGIS to explore new features from the external data, performed feature engineering and feature selection, and optimized the models. By using new features, the final model achieved a 20% lift on users’ lifetime value prediction.
Our Team: Ruixuan Zhang, Brigit Lawrence-Gomez
Goal: Develop ensemble classification model to classify book chapters and improve user reading experience
In order to ensure maximum reader satisfaction, Scribd is interested in presenting their digital reading content in the optimal way - skipping all the boring stuff at the beginning and ends of the book. In order to correctly tag their vast digital library, they rely on the power and efficiency of machine learning. We successfully applied various machine learning, feature extraction, and natural language processing tool to chapter-level book data, achieving 96% accuracy using Python, Scikit-learn, and GloVe Word Vectors.
Our Team: Sheri Nguyen
Goal: Built a model to predict the amount of days a package will be estimated to be in transit
As a shipping API platform, Shippo aims to make shipping fast, cost-effective and easy to use for its consumers. This project served to give an alternative arrival estimation in addition to the carrier's estimated delivery date. She tested a variety of models including: Random Forest, Linear Regression, Poisson Regression and Gradient Boosting. To evaluate her results, she used K-Fold Cross Validation and error metrics such as MSE and MAE.
Our Team: Jinxin Ma
Goal: Use network analysis to determine the most central customers of the bank
Jinxin used R and Python to perform network analysis using Silicon Valley Bank’s CRM data. The analysis helped determine the most central customers and thus providing the bank information on which customers to further invest in.
Our Team: Juan Pablo Oberhauser
Goal: Build a Data Acquisition pipeline for Fasta RNA sequencing files
The pipeline used tools such as pseudo-alignment and pyspark to download, reshape, and quantify genomic data. Juan Pablo built a tree-based ensemble classification program in Spark to diagnose several diseases and viral infections. He primarily used Spark, Python and AWS.
Our Team: Arda Aysu
Goal: Build a model to predict student scores on external state-mandated exams using internal metrics
After gaining familiarity with the structure of Summit's student performance data, a model was built in Python to predict their test scores. Time was also spent giving insight on data-driven approaches and helping the Summit team learn Python.
Our Team: Roger Wu
Goal: Quantify the impact of pricing and other factors on conversion
Turo is a car rental marketplace where car owners can rent out their cars to travelers. Understanding how price and other factors impact conversion is important for the success of the business. Using R and SQL, Roger quantified the impact of these factors on conversion. This was done using an observational study and a logistic regression model.
Our Team: Yichao Zhu
Goal: Explore the seasonality and regional outbreak of conjunctivitis based on social media posts
Yichao extracted tweets that mentioned conjunctivitis and all the related fields such as replies, location and time using Python, AWS and Twitter API. Social network based methods are implemented to estimate the users' location, which was not provided. She also suggested NLP (content-based method) for location estimation. She created a database to store the datasets for further usage.
Our Team: Vincent Rideout
Goal: Use Deep Learning to study the radiation therapy pretreatment process
Vincent used convolutional neural networks to predict pretreatment passing rates from images representing the radiation dosage applied to each part of the body (fluence maps). He experimented with many different neural net architectures and hyperparameters to optimize the quality of the predictions. Transfer learning was used to improve performance on a small dataset: a model trained to excel at image recognition on the ImageNet real-world image dataset was repurposed for this problem and turned out to be the best solution. The team is about to submit an academic paper based on their findings and will be presenting at the American Association of Physics in Medicine Annual Meeting & Exhibition.
Our Team: Tim Zhou, Zefeng Zhang
Goal: Leveraging utility companies' water usage data to optimize operational effectiveness and revenue sources
Tim developed a hybrid clustering algorithm using KMeans and DBSCAN to help flag potentially anomalous water meters. Tim explored using Fourier analysis to identify periodic behavior, but eventually settled on a simple autocorrelation algorithm. Zefeng investigated correlations between gas and water usage data. Zefeng developed an anomaly detection model using lag 1 Markov Chains.
Our Team: Lawrence Barrett
Goal 1: Identify outliers in self-recorded weight data to clean datasets
Lawrence used R and a distance-based outlier detection algorithm to identify outliers in weight data. The algorithm came from an academic paper, but it was modified to work on the weight data that Vida provided.
Goal 2: Evaluate performance of a new coach matching algorithm via A/B testing
Lawrence used A/B testing in R to confirm that the new coach matching algorithm significantly improved communication between coaches and users. He also evaluated the power of the test to determine the reliability of these results.
Goal 3: Pulling data from Vida's database for company reports
Lawrence used SQL in BigQuery, Google's relational database, to pull data for reports that helped determine the effectiveness of Vida's weight loss, diabetes prevention, and blood pressure monitoring programs. This was an ongoing need throughout most of the internship since reports needed to be updated and turned in on a monthly basis. Lawrence used a combination of SQL and R to query and clean the data for these reports.
Goal 4: Testing the Vida ChatBot
Lawrence implemented many tests on functions in production that are used to understand user text input by grabbing the important words needed to respond to the user's queries. He used Python's unit testing framework to test these functions.
Our Team: Donny Chen
Goal 1: Develop recommendation systems to personalize users’ healthcare reading
Donny performed NLP topic modeling and applied information retrieval on large documents and users’ messages to engineer features. He developed recommendation systems including collaborative filtering in Python to personalize healthcare readings upon a keywords search.
Goal 2: Data pull from Google BigQuery
Donny used SQL to retrieve and aggregate data from Google BigQuery to get the data from KPI to users’ metrics. He also used R to cleanse and visualize the data for various analytical reports.
Goal 3: Design and migrate from Google BigQuery RDBMS to Neo4j graph database
Donny contributed to designing schema and importing data for healthcare readings in Neo4j graph database to help transit from a traditional relational database to an advanced and scalable graph database. Retrieving connected information in a relational database is often cumbersome since it requires lots of tables joining which can be achieved much more efficiently in a graph database.
Goal 4: Engineered and deployed webserver event tracking of changes of users in Django
Donny used Django to automate events logging to keep track of changes in users’ information. He worked in Python and contributed to the web server in production.
Our Team: Shivakanth Thudi, Danny Suh, Matthew Wang, Jennifer Zhu
Goal: Improve the revenue share model
Vungle’s goal is to evaluate the feasibility and profitability impact from engaging in a revenue share business model with advertisers. Matthew and Jennifer constructed a data extraction and processing pipeline using Python and Spark and built an interactive dashboard as well as a Machine Learning model to enable the sales team to explore, visualize and predict profitable segments of the market. Danny and Shiva improved Vungle’s lifetime value (LTV) models through the use of libraries such as XGBoost, SKlearn, and Spark ML.
Our Team: Albert Ma
Goal: Implement recommendation system using collaborative filtering on users and wikis
Wikia has recommended wikis at the bottom of their pages selected from a custom list. Albert wanted to generate recommendations using collaborative filtering instead of manually picking which wikis to be shown to users. He implemented a recommendation system using alternating least squares and matrix factorization in Python using packages numpy and pandas. He generated baseline models with random guess and recommending popular wikis to evaluate model performance. He was able to reduce miss rate by 25% and 20% compared to random guess and popular wikis respectively.
Our Team: Maxine Qian, Zainab Danish
Goal 1: Build machine learning models to predict prospects of Open Kitchen products
Goal 2: Incorporate a new variable ‘Clumpiness' and gauge whether it is a valuable addition to the RFM framework through modeling
Maxine and Zainab worked with the analytics team to generate consumer insights for a new brand and built models to predict the likelihood of a user purchasing the new brand. They extracted features using SQL, conducted exploratory data analysis in R, and performed feature engineering and built classification models using Python. The model was used to decide which customers to target for the new brand. Also, they used machine learning methods to gauge whether ‘Clumpiness’ variable it is a valuable addition to the RFM framework.
Our Team: Valentin Vrzheshch
Goal: Build a tool to benchmark the company’s trading performance
Valentin wrote Python scripts that compute indicators of high-frequency trading costs for each order (metrics such as implementation shortfall, toxicity and market impact) using pandas and psql. Valentin also developed dashboards with visualizations of the indicators using R (shiny, plotly, googlevis) for easy and fast assessment of algorithm’s performance.
Our Team: April Liu
Goal 1: Build a fraud detection model
April evaluated lasso logistic regression, random forest, AdaBoost/gradient boosting and built a final model that improved baseline F0.5 score by 35%. She evaluated various performance metrics based on the company’s business model and performed correlation analyses and data transformations.
Goal 2: Pipeline for feature importance analysis
April designed and built different Cassandra DBs to house 100 GBs of data. She imputed missing signals in data used to train risk profile models and implemented hashing method to impute signals on a rolling basis. She also assessed feature importances via random forest classifier and extracted over 100 signals from over 200 GBs of data to identify fraud using Python.
Goal 3: Use one set of signals to replace existing expedite rule to achieve higher performance in filtering out high risk transactions while maintaining high proportion of the expedited transactions
April performed advanced analytics and exploratory data analysis in two different payment source signals to come up a solution to improve the current payment expedite rule.
Our Team: Alice Zhao
Goal: Build fraud detection models
Alice worked on building fraud detection models for Non-Sufficient Fund Fraud which is a difficult problem due to highly imbalanced dataset. Alice tried different sampling methods, cost-sensitive algorithms as well as hashing tricks to build a good model using Python, R and SQL.
Our Team: Jinxin Ma, Alice Zhao
Goal: Compare feature importance measures and build a better fraud detection model
Jinxin and Alice worked under the Risk Data team at Xoom. They compared feature importance measures from random forest, gradient boosting, and extra trees and predicted whether transaction is fraudulent using a logistic regression using Python, R, and Tableau.
Our Team: Jinxin Ma
Goal: Impute missing values for transactions and re-evaluated fraud detection rules
Jinxin created his own database using PostgreSQL and wrote efficient queries to impute missing values for millions of transactions. The imputation improved the performance of fraud detection rules.
Our Team: Ben Miroglio and Chhavi Choudhry
Goal: cluster web sessions to segment users and improve the flow of Airbnb's website and mobile app
Ben and Chhavi employed machine learning techniques to identify features indicative of positive outcomes using R and Python. They built interactive web session visualizer using D3.js to identify key differences among different segments of users and to identify bottlenecks in the session journey.
Our Team: Paul Thompson and Jacob Pollard
Goal: cluster users based on factors influencing event preferences and classify event category based on the event description
Paul and Jake applied the ROCK hierarchical clustering algorithm to event data in Python and clustered users based on the events they attended. They also implemented a Naive Bayes classifier using only base Python data structures, employing 5-fold cross validation, resulting in 75% mean accuracy.
Our Team: Vincent Pham and Brynne Lycette
Goal: employ machine learning techniques for credit card fraud detection and build a data unification platform
Capital One's fraud team has collected and built more than two hundred features relevant to classifying fraudulent credit card transactions. Vincent and Bree employed various machine learning techniques using H2O and Dato in order to evaluate software robustness and increase accuracy of fraud prediction. They also implemented a NoSQL data store and a higher level in-memory storage system to unify various streaming and batch processes.
Our Team: Ghizlaine Bennani and Mrunmayee Bhagwat
Goal: cluster similar YouTube channels
Ghizlaine and Mrun developed an algorithm using supervised and unsupervised machine learning techniques using Python and PostgreSQL to cluster similar YouTube channels and videos based on performance and content metrics. This clustering resulted in personalized targeting for multi-channel YouTube content providers. They also modeled median views for individual YouTube videos in their first week using regression analysis.
Our Team: Isabelle Litton
Goal: extract sites of cancer recurrence in clinical notes
Isabelle leveraged natural language processing using Linguamatics to capture cancer recurrence sites from clinical and radiology notes. She also automated the results-validation process with Python, saving approximately two hours per validation.
Our Team: Tate Campbell and Sharon Wang
Goal: model consumer churn for a product loyalty program that identifies key features contributing to customer retention rates
Tate and Sharon used PySpark to extract relevant data and perform feature engineering on more than 10 GB of data. A random forest classification model was built using Python to predict the amount of time consumers would stay actively enrolled in the program.
Our Team: Miao Lu
Goal: help Dictionary.com understand super-user behavior
Our Team: Meg Ellis and Jack Norman
Goal: create a price-suggestion model to assist event organizers in optimizing ticket sales and revenue
Identifying important features that most influence ticket prices, Meg and Jack implemented a K Nearest Neighbors model that clusters events with similar characteristics, and subsequently leveraged the distribution of costs of these similar, successful events to suggest an appropriate range of ticket prices that the organizer can use when creating their event. Flask was subsequently used to create a web application to allow users to interact with the model.
Our Team: Piyush Bhargava, Felipe Chamma and Harry O'Reilly
Goal: optimize liquidity allocation
Felipe and Harry used SQL Server Piyush to developed a new process to optimize cash allocation for the liquidity buffer by leveraging client transaction data and end of day positions. The process was automated, now running on a daily basis to improve performance. An SQL-based tool was also designed to detect unusual customer transaction patterns and alert bank representatives to mitigate liquidity risks. The loan-level loss allocation process was also automated to support capital stress testing required by new financial regulations.
Our Team: Matthew Leach
Goal: investigate customers who claim FLYR's FareKeep product allows customers to lock in a flight price for up to 7 days
Matthew investigated how to better identify which customers were more likely to make a FareKeep claim using clustering techniques to group customer segments, and employing logistic regressions to predict claim rate. He additionally used Bokeh to create a dashboard displaying factors which influence the claim rate.
Our team: Ghizlaine Bennani
Goal: build a dashboard to visualize JUVO data in an interactive fashion
Built a dashboard using Flask app through python to render an interactive D3 visualization dashboard that synchronizes real time data from redshift. The purpose of the dashboard is to help customers understand JUVO business and visualize data to help business unit in their decision making process.
Our Team: David Wen, Jaime Pastor
Goal: create a dashboard to analyze user retention and build customer credit profiles
Our Team: Tate Campbell, Sharon Wang
Goal: build a machine learning model to predict inquiry frequency to optimize bid prices for Good AdWords
Tate and Sharon used k-means clustering to clusters hours of the day which have similar numbers of impressions and clicks, then employed a random forest to quantify the relationship between cost and inquiries. They also built an optimization algorithm to search for the best combination of daily bid prices based on the predicted number of inquiries and budget constraints.
Our Team: Gabrielle Corbett and Jason Helgren
Goal: validate the utility of an app-based street sweeping alert
Metromile provides an app with convenience features including fuel economy analysis, engine monitoring, and street sweeping alerts. Gabby and Jason developed a logistic regression model using Python and PostgreSQL to predict driver behavior based on past driving habits and whether the driver received a street sweeping alert.
Our Team: Mrunmayee H. Bhagwat
Goal: build a time series forecasting model
Using NLP techniques Mrunmayee categorized Walmart’s unstructured social media data and modeled their social buzz using a generalized linear model. She also performed time series analysis on the data to identify seasonal fluctuations and trend in Walmart’s quarterly revenue and built a revenue forecast model using a SARIMA approach in R.
Our Team: Kirk Hunter
Goal: improve mobile targeting and optimize bid pricing in the programmatic advertising space
Programmatic advertising takes place on desktop and mobile devices and the space can yield billions of data points per day. Kirk interacted with extremely large datasets using the distributed computing framework Apache Hive. He built machine learning models using Python’s scikit-learn library that identified mobile users who are most likely to lead to a successful outcome for the company.
Our Team: Alex Morris
Goal: develop a structured method by which to extract data from SEC EDGAR filings
Alex assisted in the development of SecParser Python package and implemented a data pipeline which extracts key data for analysis from unstructured SEC EDGAR form filings (Form 4). He subsequently carried out various machine learning and regression techniques on the parsed data to identify which filings have a significant impact on the issuers stock performance following large insider transactions.
Our Team: Jaclyn Nguyen
Goal: identify achievement gaps and calibrate grading system
Jaclyn used national MAP assessments and statistical analysis to confirm that students met their projected assessment growth independent of race and socioeconomic group. She additionally conducted an analysis on whether teachers graded consistently across grade levels and subjects, and provided a regression-based recalibration technique.
Our Team: Alex Romriell and Swetha Venkata Reddy
Goal: create a model to detect conjunctivitis outbreaks
Alex and Swetha performed text and geo-spatial analysis on over 300,000+ tweets to detect local pinkeye outbreaks. They created a framework for identifying tweets directly related to conjunctivitis. Surges in outbreaks were mapped to clinical records nationwide. Time series analysis of the tweets revealed similar trends and seasonality compared to the actual hospital data. Text analysis techniques such as like Latent Symantec analysis on AWS were employed to filter noise from the data. A multinomial Naive Bayes model was also developed based on TFIDF scores of the tweets to predict the sentiment.
Our Team: Jacob Pollard
Goal: identify potential donors
Using Random Forest models in R, Jacob selected 10 among 70 total variables in the USF alumni donor database that had the strongest influence on predicting a potential donor. These variables were employed in building an ensemble of logistic regression classifiers with bagging. This method was subsequently implemented in Python with the help of the Scikit-Learn, Numpy, and Pandas.
Our Team: Paul Thompson
Goal: improve predictions of freelancer job performance ratings
Paul created classification models using LSTM and GRU Neural Networks, Word2Vec and Doc2Vec, and TF/IDF in conjunction with machine learning algorithms such as random forest and support vector machines. He used python scikit-learn, gensim, and keras and ran models both locally and on AWS EC2 clusters (using CloudML) and EC2 GPU (using ssh), succeeding in improving upon the existing production model.
Our Team: Ryan Speed
Goal: develop a framework to support processing and analysis of NBA player tracking data
Ryan built a MapReduce framework in Python to support creation of descriptive and predictive models, and generate a set of analytics output for SportVu NBA Player Tracking Data which generates 800,000 data points per game. Clustering an similarity scoring was conducted on player location distributions, court spacing, and distance traveled across multiple 55GB datasets.
Our Team: Alex Romriell and Swetha Venkata Reddy
Goal: determine the efficacy of specialized OcculusRift eye treatment software
Alex and Swetha optimized the ETL (extract, transform, load) processes using Python and PostgreSQL to better ingest user game-play data. SQL was used to query the dataset and retrieve key eye metrics from the game logs. They confirmed that the treatment correctly targeted the weak eye. They also created a D3.js dashboard to visual patterns and behaviors of users’ gaming sessions.
Our Team: Chhavi Choudhury, Yikai Wang and Wanyan Xie
Goal: develop a User Conversion Rate Model and a User Lifetime Value Model as a mobile app advertising company
Vungle built an ad recommendation system based on a user-conversion rate prediction model with ad view-install data. Wanyan implemented a factorization machine model to quickly calculate the weights of interaction terms in logistic regression model and Yikai ran a test-performance-based feature selection algorithm to select features in current model and also implemented a gradient boosting tree model. Using Python and Spark, they were able to improve both the efficiency and accuracy of predictions. In an attempt to build user lifetime value (LTV)-related models, Chhavi, Yikai and Wanyan identified germane user-related features and developed various models to predict active user days and 7-day revenue across different advertisers.
Our Team: Isabelle Litton
Goal: automate content tagging process to improve ad targeting
Isabelle trained multiple classifiers using Python’s Sklearn package and tfidf features to achieve an overall accuracy 86%.
Our Team: Jaclyn Nguyen, Sakshi Bhargava, and Henry Tom
Goal: develop an automatic image selection and image tagging algorithm
Williams-Sonoma’s product feed contains a tremendous amount of image data, and it is difficult to automatically tag images for both analysis and product recommendations. Jaclyn, Sakshi and Henry successfully automated this process through the use of custom built ensemble image processing algorithms, superpixels, and other recent imaging advances, achieving an accuracy of 99% between silhouette/product images. They also developed a color prediction and color labeling algorithm with 90% accuracy.
Our Team: Erica Lee and Binjie Lai
Goal: improve and test dynamic pricing strategies
Wiser provides pricing strategies for e-commerce retailers to optimize revenue. Binjie applied tuned ridge regression and time series models for prediction, designing experiments and testing results, improving upon the current prediction model by 15%. Erica employed a linear mixed effects model to measure the effectiveness of the dynamic pricing engine, using technologies which included Python, Spark, and PostgresSQL, as well as working on big data platforms Amazon Redshift and Databricks.
Our Team: Felipe Formenti Ferreira
Goal: enhance Womply's Statboard metrics, evaluate customer churn, and identify high-value clients
Using client's daily revenue data, Felipe used python to parse the transaction history and provide business insights such as growth rate and average transaction size. Felipe also used principal component analysis to eliminate any correlation between predictor variables and identify influential factors contributing to the customer retention.
Our Team: Brian Kui and Tunc Yilmaz
Goal: implement generalized linear models and neural network models to improve existing load forecasting models
AutoGrid helps industrial customers shed power by controlling the operation power consuming devices such as water heaters. The team evaluated modifications to the forecasting models proposed by the data science team in order to help AutoGrid decide whether it is feasible to incorporate the modifications in the production code. They analyzed signals received, load, and the state of the water heaters, and identified errors in operation.
Our Team: Cody Wild
Goal: provide a means for ChannelMeter to leverage its 300,000-channel database to identify close niche competitors for product's subscribers
Cody utilized clustering and topic modeling, with a Mongo and Postgres backend, to construct a channel similarity metric that utilizes patterns of word reoccurrence to identify nearest neighbors in content space.
Our Team: Kailey Hoo, Griffin Okamoto, and Ken Simonds
Goal: mine actionable insights from over 20,000 online product reviews using text analytics techniques in Python and R
The team quantified consumer opinions about a variety of product attributes for multiple brands to assess brand strengths and weaknesses.
Our Team: Matt Shadish
Goal: apply machine learning techniques to improve an existing trading strategy
Matt used Python and pandas to incorporate external variables and build cross-sectional models to a time series problem. He also created visualizations of current trading strategy performance using ggplot2 in R.
Our Team: Brian Kui and Tunc Yilmaz
Goal: query time-series printer data that is highly unbalanced: less than 200 faults within two million time records
Brian and Tunc applied machine learning algorithms to predict rare failures of industrial printers in order to find a model to implement in production for real-time predictions.
Our Team: Alice Benziger
Goal: create a popularity index for Dictionary.com’s Word of the Day feature based on user engagement data, such as page views (on mobile and desktop applications), email click-through rates, and social media (Facebook, Instagram, and Twitter) interactions
Alice applied machine learning techniques to implement a model to predict the popularity score of new words to optimize user engagement.
Our Team: Matt Shadish
Goal: perform analysis of historical retail product prices across stores
Using Python Matt created visualizations of these analyses in Matplotlib. He then applied the analysis as a functional solution (using RDD’s and DataFrames) so as to take advantage of Apache Spark. This enabled an analysis of billions of price history records in a reasonable amount of time.
Our Team: Steven Chu
Goal: define, calculate, and analyze product features, user lifetime value, user behavior, and film success metrics
As Fandor is a subscription-based model, their focus is to bring in more subscribers and retain current subscribers. There is a lot of potential to use metrics to segment as well as run predictions for users. Currently, one of these metrics (film score) is in production as a time-series visualization for stakeholders to see and utilize in their own decision-making processes.
Our Team: Florian Burgos and Dan Loman
Goal: use machine learning to predict the price of connecting flights based on the price of the one-way tickets
Florian and Dan improved user engagement on the website by displaying content on the landing page with d3. They also computed content overnight using distributed computing on an AWS ec2 instance to find the best deals in the U.S. by origin.
Our Team: Chandrashekar Konda
Goal: solve parts normalization and payment terms mapping tasks
Using Hadoop and Elastic search, Chandrashekar identified similar mechanical parts out of five million parts in oil rig design versions for GE Oil & Gas business.
In a separate project Chandrashekar using Python and Talend, we identified the best payment terms from one million payment terms across GE’s different businesses.
Sourcing: Using Python, we compared over 1.8 million purchase transactions with 50,000 of GE’s products to ascertain whether GE can benefit if all materials are procured from other GE subsidiaries.
Our Team: Sandeep Vanga
Goal: perform unsupervised text clustering to gain insights into representative sub-topics
Sandeep built a baseline model using Kmeans clustering and tfidf features. He also devised two variants of Word2Vec (deep learning-based features) models. The first method is based on aggregation of word vectors and the second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year. Various quantitative metrics such as entropy, silhouette, score, etc. and visualization techniques were used to validate the algorithms.
Our Team: Brendan Herger
Goal: study existing data stream to drive business decisions, and optimized data extract-transform-load process to enable future insightful real-time data analysis
Though Lawfty’s existing pipeline had substantial outage periods and largely unvalidated data, Brendan was able to support creating a new a Spanish language vertical, creating near-real-time facilities, and contribute to better targeting AdWords campaigns.
Our Team: Fletcher Stump Smith
Goal: perform natural language processing (NLP) and document classification using Naive Bayes with scikit-learn and sparse vector representations (Scipy).
Fletcher wrote code to store and process text data, using Python and SQLite. He performed continuous testing and refactoring of existing data science code. All of this went towards building a framework for finding words relevant to specific jobs.
Our Team: Michaela Hull
Goal: find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction
Michaela employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and a variety of machine learning techniques.
Our Team: Layla Martin and Patrick Howell
Goal: develop a machine learning model to predict a flavor label for every food in MyFitnessPal’s database
Using primarily Python and SQL, the team built a data pipeline to better deliver subscription numbers and revenue to business intelligence units within UnderArmour.
Our Team: Leighton Dong
Goal: build consumer credit default risk models to support clients in managing investment portfolios
Leighton prototyped a methodology to measure default risk using survival analysis and a cox proportional hazard model. He developed an automated process to comprehensively collect company information using Crunch Base API and store them in a NoSQL database. Leighton also engineered datasets to discover potential clients for analytics products (such as retail pricing optimization) and collected company names and other text features from Bing search result pages automatically.
Our Team: Brendan Herger
Goal: build out multiple data pipelines and utilize machine learning to help drive REVUP's beta product
Brendan was able to create three new data streams which were directly put into production. Furthermore, he utilized natural language processing and machine learning to validate and parse mechanical turk output. Finally, he utilized spectral clustering to identify individual’s political affiliation from Federal Elections Commission data.
Our Team: Rashmi Laddha
Goal: build a predictive model for revenue forecasting based on stylist’s cohort behavior
Rashmi clustered stylists’ micro-segments by analyzing their behavior in initial days of joining the company and used k-means clustering on three parameters to cluster stylists. She then built a forecast model for each micro-segment in R using HoltWinters filtering and ARIMA, tuning the model to get an error rate within 5%. She also performed sensitivity analyses around changing early performance drivers in stylist’s life cycle.
Our Team: Griffin Okamoto and Scott Kellert
Goal: demonstrate the efficacy of online content assessments.
Griffin and Scott demonstrated the efficacy of Summit's online content assessments by using student scores on the assessments and demographic information to predict standardized test scores. They developed a linear regression model using R and ggplot2 and presented results and recommendations for Summit's teaching model to the Information Team.
Our Team: David Reilly
Goal: examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R
David constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. He analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.
Our Team: Layla Martin and Leighton Dong
Goal: analyze influential factors in USF undergraduate student retention using logistic regression models
The team predicted students' decisions to withdraw, continue, or graduate from USF by leveraging machine learning techniques in R. These insights have been used to improve institutional budget planning.
Our Team: Sandeep Vanga and Rachan Bassi
Goal: automate the process of image tagging by employing image processing as well as machine learning tools
Williams-Sonoma’s product feed contains more than a million images and the corresponding meta data — such as color, pattern, type of image (catalog/multiproduct/single-product) — is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram-based computer vision techniques to segment and identify important regions/features of an image. A decision tree-based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/single-product images and 70% accuracy in case of complex multiproduct/catalog images.
Our Team: Luba Gloukhova
Goal: quantify the performance of an underlying high frequency trading strategy
Luba expanded existing internal database with data sources from Bloomberg Terminal, enabling deeper understanding of symbol characteristics underlying strategy performance. She also identified discrepancies in an end-of-day trading analysis database.
Our Team: Daniel Kuo
Goal: develop a supervised machine learning algorithm for a Publication Authorship Linkage project to determine whether multiple publications are co-referred to the same authors
Via Zephyr's DMP system, the algorithm leverages the existing institution-to-institution record linkage to easily augment new attributes and features into models. The modeling techniques used in this project include logistic regression, decision trees, and adaboost. The team used the first two algorithms to perform feature selections and then used the adaboost to improve performance.
Our Team: Monica Meyer and Jeff Baker
Goal: Develop a classification algorithm/model for the Disease Area Relevancy project that would predict and score how related a given document was to a specified disease area.
The model provides Zephyr the ability to quickly score and collect documents, as they relate to a disease, to provide resulting documents to clients. Our team explored four different algorithms to address this problem: logistic regression, bagged logistic, naïve Bayes, and random forest. Both binary and multi-label approaches were tested. The approach is scalable to include other document types.
Our Team: WeiWei Zhang
Goal: Determine disease area relevancy for medical journals using machine learning techniques.
The project began with data sampling from the PubMed database. Through natural language processing and feature engineering process, the text of abstract and title of medical documents were transformed into tokens with TF-IDF (Term Frequency, Inverse Document Frequency) scores. By leveraging the characteristics of a random forest classifier, the most important features from the feature space were selected. The body of the model was a multi-label logistic regression. The results were evaluated based on the accuracy, recall, precision, and F1 score. In short, the project is a great example of handling unlabeled data, imbalanced classes, and multi-label problems in a machine learning context.