Sandeep Vanga (’15) worked with Google to perform unsupervised text clustering on news articles of certain brands/topics of interest to Google to gain insights into the most representative sub topics. He built a baseline model using Kmeans clustering and tfidf features. He also devised two variants of Word2Vec (deep learning based features) models. First method is based on aggregation of word vectors and second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year duration. Various quantitative metrics such as entropy, silhouette score etc and visualization techniques were used to validate the algorithms. BoClu based Word2Vec model outperforms other methods consistently but it is very sensitive to number of clusters. Using more robust method to choose optimal number of clusters along with BoClu will further improve the quality of clusters.
Sandeep Vanga ('15) and Rachan Bassi ('15) worked with Williams-Sonoma. W&S product feed contains more than a million images and the corresponding meta data such as color, pattern, type of image (catalog/multiproduct/singleproduct) is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram based computer vision techniques to segment and identify important regions/features of an image. Further decision tree based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/singleproduct images and 70% accuracy in case of complex multiproduct/catalog images.
Michaela Hull ('15) worked with LA County to find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction. This employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and all sorts of machine learning techniques.
WeiWei Zhang ('15) worked with Zephyr Health. The core of the practicum at Zephyr Health is a machine learning project focuses on determining the disease area relevancy for the medical journals. The project began with data sampling from the PubMed database. Through natural language processing and feature engineering process, the text of abstract and title of medical documents were transformed into tokens with TF-IDF (Term Frequency, Inverse Document Frequency) scores. By leveraging the characteristics of a random forest classifier, the most important features from the feature space were selected. The body of the model was a multi-label logistic regression. The model result were evaluated based on the score of accuracy, recall precision and F1. In short, the project at Zephyr Health has become a great example of handling no labeled data, imbalanced classes and multi-label problems in a machine learning realm.
Florian Burgos ('15) and Dan Loman ('15) worked with Flyr to use machine learning to predict the price of connecting flights based on the price of the one-ways. Improved user engagement on the website by displaying content on the landing page with d3. Content computed overnight using distributed computing on an AWS ec2 instance to find the best deals in the US by origin.
Matt Shadish ('15) worked with Convergence Investment Management to apply machine learning techniques to improve on an existing trading strategy. In particular, he used Python and pandas to incorporate external variables and build cross-sectional models to an inherently time series problem. Created visualizations of current trading strategy performance using ggplot2 in R.
Matt Shadish ('15) worked with Engage3 to perform analysis of historical retail product prices across stores using Python. Created visualizations of these analyses in Matplotlib. Applied the analysis as a functional solution (using RDD’s and DataFrames) so as to take advantage of Apache Spark. This allowed the ability to analyze billions of price history records in a reasonable amount of time.
Fletcher Stump Smith ('15) worked with LiveCareer.com to perform natural language processing (NLP) and document classification using Naive Bayes with scikit¬-learn and sparse vector representations (Scipy). Wrote code for storing and processing text data, using Python and SQLite. Performed continuous testing and refactoring of existing data science code. All of this went towards building a framework for finding words relevant to specific jobs.
Brian Kui ('15) and Tunc Yilmaz ('15) worked with AutoGrid to implement generalized linear models and neural network models to improve the existing load forecasting models at AutoGrid. Evaluated modifications to the forecasting models proposed by the data science team in order to help them decide whether it is feasible to incorporate the modifications in the production code. AutoGrid helps with its demand response program large, industrial customers to shed power by controlling power consuming devices such as water heaters. They analyzed the signals received, the load and the state of the water heaters and pointed out to mistakes in the operation.
Brian Kui ('15) and Tunc Yilmaz ('15) worked with Danaher Labs to query time series printer data that is extremely unbalanced with less than 200 fault examples within more than 2 million time records that are not faults. They applied machine learning algorithms to predict rare failures of industrial printers in order to find a model to implement in production for real-time predictions.
Griffin Okamoto ('15) and Scott Kellert ('15) worked with Summit Public Schools to demonstrate the efficacy of Summit's online content assessments by using student scores on these assessments and demographic information to predict their standardized test scores. Developed a linear regression model using R and presented results and recommendations for Summit's teaching model to the Information Team.
Kailey Hoo ('15), Griffin Okamoto ('15) and Ken Simonds ('15) worked with Clorox to mine actionable insights from over 20,000 online product reviews using text analytics techniques in Python and R. Quantified consumer opinions about a variety of product attributes for multiple brands to assess brand strengths and weaknesses.
Rashmi Laddha ('15) worked with Stella & Dot to build predictive a model for revenue forecasting based on stylist’s cohort behavior. She clustered stylists’ micro-segments by analyzing their behavior in initial days of joining the company and used k-means clustering on three parameters to cluster stylists. Built a forecast model for each micro-segment in R using HoltWinters filtering and ARIMA. Optimized the model to get error rate within 5%. Performed sensitivity analyses around changing early performance drivers in stylist’s life cycle.
Chandrashekar Konda ('15) worked with GE Software to solve three complex tasks in Sourcing big data project for GE. Parts Normalization - Identified similar mechanical parts out of 5M parts in Oil Rig design versions for GE Oil & Gas business using Hadoop and Elastic search. Payment Terms Mapping - Identified the best Payment terms from 1M payment terms across GE’s different businesses, using Python and Talend. Sourcing - Compared over 1.8 million purchase transactions with 50k GE’s products to ascertain whether GE can benefit if all materials are procured from other GE subsidiaries, using Python.
Alice Benziger ('15) worked with Dictionary.com to create a popularity index for Dictionary.com’s Word of the Day feature based on user engagement data like page views (on mobile and desktop applications), email CTRs and social media (Facebook, Instagram & Twitter) interactions. Applied machine learning techniques to implement a model to predict the popularity score of new words to optimize user engagement.
Steven Chu ('15) worked with Fandor to define, calculate, and analyze product features, user lifetime value, user behavior, and film success metrics. The main goal was to understand context in which business decisions are made at Fandor. In general, as Fandor is a subscription-based model, their focus is twofold: 1) bringing in more subscribers, and 2) keeping current subscribers around. There is lots of potential to use these metrics to both segment as well as run predictions for users. Currently, one of these metrics (film score) is in production as a time-series visualization for stakeholders to see and utilize in their own decision-making processes.
Brendan Herger ('15) worked with Lawfty to study existing data stream to drive business decisions, and optimized data extract-transform-load process to enable future insightful realtime data analysis. Though Lawfty’s existing pipeline had substantial outage periods and largely unvalidated data, he was able to support creating a new a Spanish language vertical, creating near-realtime facilities, and contribute to better targeting AdWords campaigns.
Brendan Herger ('15) worked with RevUp to build out multiple data pipelines and utilize Machine Learning to help drive RevUp’s beta product. He was able to create 3 new data streams which were directly put into production. Furthermore, he utilized Natural Language Processing and Machine Learning to validate and parse Mechanical Turk output. Finally, he utilized Spectral Clustering to identify individual’s political affiliation from Federal Elections Commission data.
Layla Martin ('15) and Patrick Howell ('15) worked with MyFitnessPal.com to develop a machine learning model to predict a flavor label for every food in MyFitnessPal’s database, primarily using Python and SQL. Built a data pipeline to better deliver subscription numbers and revenue to business intelligence units within UnderArmour.
Layla Martin ('15) and Leighton Dong ('15) worked with USF's Center for Institutional Planning and Effectiveness to analyze influential factors in USF undergraduate student retention using logistic regression models. Predicted students' decision to withdraw, continue, or graduate from USF by leveraging machine learning techniques in R. These insights have been used to improve institutional budget planning.
Leighton Dong ('15) worked with Quiota to build consumer credit default risk models to support clients in managing investment portfolios. Prototyped a methodology to measure default risk using survival analysis and cox proportional hazard model. Developed an automated process to comprehensively collect company information using Crunch Base API and store them in NoSQL database. Engineered datasets to discover potential clients for analytics product such as retail pricing optimization. Collected company names and other text features from Bing search result pages automatically.
David Reilly ('15) worked with Uber to examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R. Constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. Analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.
Steven Rea ('15) worked with ChannelMeter to create an alert for YouTube channels to notify accounts when positive activity happened on their channel. Researched by extracting all video data for a sample of clients' YouTube channels. Implemented in python and postgres. Predicted the number of views a video would get before it was created, looking at previous YouTube data. Also was able to give insight into behavior of YouTube videos. Implemented with machine learning techniques in Python and R.
Steven Rea ('15) and Cody Wild ('15) worked with Baye’s Impact to analyze software logs to determine how customers were using their product, segmenting customers into different groups for analysis. Also offered advice on how to achieve statistically significant results for future analyses. Implemented in python and postgres.
Cody Wild ('15) worked with ChannelMeter to provide a means for ChannelMeter to leverage its 300,000-channel database to identify close niche competitors for their product's subscribers. Utilized clustering and Topic Modeling, with a Mongo and Postgres backend, to construct an ultimate channel similarity metric that utilizes patterns of word reoccurance to identify nearest neighbors in content space.
Luba Gloukhova ('15) worked with Xambala to quantify the performance of an underlying high frequency trading strategy executed by Xambala. Expanded existing internal database with data sources from Bloomberg Terminal enabling deeper understanding of symbol characteristics underlying strategy performance. Identified discrepancies in end of day trading analysis database.
Daniel Kuo ('15) worked with Zephyr Health on a Publication Authorship Linkage project to develop a supervised machine learning algorithm to determine whether two (multiple) publications are co-referred to the same authors or not. Via Zephyr's DMP system, the algorithm leverages the existing institution to institution record linkage to easily augment new attributes and features into models. The modeling techniques used in this project include logistic regression, decision tree, and adaboost. We use the first two algorithms to perform feature selections and then use the adaboost to boost the models' performance. We went through 3 modeling iterations with the sample from the entire population for the first two runs while the sample from a specific group (publications from western authors with affiliations) for the last. The final evaluations were attained by testing three different models on the unseen testing set from the specific population. The third model is far better than the first two in terms of F0.5 scores (0.44, 0.67, 0.88). Therefore, we suggest the company to build separate models for different sub-populations in order to achieve better results.
Monica Meyer ('15) and Jeff Baker ('15) worked with Zephyr Health on the Disease Area Relevancy project to develop a classification algorithm/model that would predict and score how related a given document was to a specified disease area. The model provides Zephyr with the ability to quickly score and collect documents, as they relate to a requested disease, to provide resulting documents to clients. Our team explored four different algorithms to address this problem: logistic regression, bagged logistic, naïve Bayes, and random forest. Both binary and multi-label approaches were tested. The binary logistic regression was the best performer. This approach can be scaled to include other document types.
Our Team: Kicho Yu, Spencer Boucher, Trevor Stephens
Goal: Forecast the energy use and peak power events of residential and commercial end users.
By applying machine learning techniques to predict future demand our team provided power producers and facility managers a better way to plan for high load days. They also identified and visualized temporal/seasonal outliers in smart-meter data, and profiled various customer segmentations for a large energy utility client.
Our Team: Matt O'Brien, Prateek Singha
Goal: Analyze the relationship between macroeconomic and hiring trends in healthcare.
Our team analyzed over 3 years of hiring data within the nursing profession, and harvested the corresponding public stock information for a large set of health companies. Their results were used to determine if hiring is a leading or lagging indicator of stock movement.
Our Team: Can Jin, Prateek Singhal
Goal: Work with support.com to study chat transcripts acquired for various companies over the years.
Our team applied Text Analysis skills learned in the 'Text Analytics' class over a database that was approximately 4 TB in size.
Turbo Financial Group
Our Team: Manoj Venkatesh, Dora Wang, Li Tan
Goal: Analyze customer responses to consumer loan campaigns.
By applying a range of machine learning techniques, the team successfully overcame challenges with noisy, unbalanced data-sets and built classification models to improve the response rate of marketing campaigns.
Our Team: Deeksha Chugh, Conor O'Sullivan, Anuj Saxena
Goal: Study the effect of changes in weather variables on the sales of consumer products.
Using machine learning methods, the team developed a predictive model to link weather variables with weekly sales trends. This model paved the way for consulting business growth into major retail brands.
Our Team: Ashish Thakur, Charles Yip
Goal: Create a production grade statistical algorithm to identify botnet attacks.
Our team was given millions of records involving user activity on the Xoom website. With the data, they successfully developed and trained two time series models that distinguished between botnet and user behavior. Xoom is planning to productionize their algorithms.
Our Team: William Goldstein, Justin Battles
Goal: Determine the significance of crash signatures.
Using map-reduce to explore over 500GB of user crash reports, they were able to build an arrival rate model to use over 400k of incoming crash signatures, and extract trend lines. Not only were they able to provide guidance on the possibilities for crash signature analysis using large volumes of data but they went one step further, and developed a dash-board to alert quality assurance team members of potential problems.
Our Team: Pete Merkouris, Rachel Philips
Goal: Study the effect of revisions to company websites and the stock returns.
The team applied statistical regression models and graphical techniques to search for signals in the stock returns based on revisions of the webpage.
Our Team: Bingkun Li, Meenu Kamalakshan
Goal: Study customer data and identify how to effectively spend their advertising budget.
The team were able to identify profitable customer segments using clustering and make a recommendation on which segments to target.
Our Team: Barbara Evangelista, Funmi Fapohunda, Oscar Hendrick
Goal: Analyze the conversion, engagement and churn of mobile app users for the purpose of informing marketing strategy and app improvement.
The team applied a range of statistical techniques to study the behavior of segmented groups and their work led to the identification of trends.
Our Team: Tak Wong, Spence Aiello
Goal: Analyze customer conversions.
The team extracted over half a million records to study the conversion rates of users from basic to paid accounts using SQL & R, and developed regression models and applied chi-square tests to predict an increase in conversion. This result was used to target the sending of impressions to basic account users.