Our Team: Sandeep Vanga
Goal: Perform unsupervised text clustering to gain insights into representative sub-topics.
Sandeep built a baseline model using Kmeans clustering and tfidf features. He also devised two variants of Word2Vec (deep learning-based features) models. The first method is based on aggregation of word vectors and the second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year. Various quantitative metrics such as entropy, silhouette, score, etc. and visualization techniques were used to validate the algorithms.
Our Team: Sandeep Vanga and Rachan Bassi
Goal: Automate the process of image tagging by employing image processing as well as machine learning tools.
Williams-Sonoma’s product feed contains more than a million images and the corresponding meta data — such as color, pattern, type of image (catalog/multiproduct/single-product) — is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram-based computer vision techniques to segment and identify important regions/features of an image. A decision tree-based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/single-product images and 70% accuracy in case of complex multiproduct/catalog images.
Los Angeles County
Our Team: Michaela Hull
Goal: Find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction.
Michaela employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and a variety of machine learning techniques.
Our Team: Florian Burgos and Dan Loman
Goal: Use machine learning to predict the price of connecting flights based on the price of the one-way tickets
Florian and Dan improved user engagement on the website by displaying content on the landing page with d3. They also computed content overnight using distributed computing on an AWS ec2 instance to find the best deals in the U.S. by origin.
Our Team: Matt Shadish
As a component in the Parts Harmonization initiative, Matt matched up similar industrial part descriptions to find identical parts possibly sourced from different vendors using Levenshtein string edit distances, and turned this process into a MapReduce job using Hadoop Streaming. He improved matching of company names to map known contacts to companies GE has business relations with (using Levenshtein string edit distances) and took advantage of multiprocessing in Python to dramatically reduce runtime.
Our Team: Chandrashekar Konda
Goal: Solve three complex tasks in Sourcing big data project for GE:
Parts normalization: Using Hadoop and Elastic search, we identified similar mechanical parts out of five million parts in Oil Rig design versions for GE Oil & Gas business.
Payment terms mapping: Using Python and Talend, we identified the best payment terms from one million payment terms across GE’s different businesses.
Sourcing: Using Python, we compared over 1.8 million purchase transactions with 50,000 of GE’s products to ascertain whether GE can benefit if all materials are procured from other GE subsidiaries.
Convergence Investment Management
Our Team: Matt Shadish
Goal: Apply machine learning techniques to improve on an existing trading strategy.
Matt used Python and pandas to incorporate external variables and build cross-sectional models to a time series problem. He also created visualizations of current trading strategy performance using ggplot2 in R.
Our Team: Matt Shadish
Goal: Perform analysis of historical retail product prices across stores using Python.
Matt created visualizations of these analyses in Matplotlib. He then applied the analysis as a functional solution (using RDD’s and DataFrames) so as to take advantage of Apache Spark. This enabled an analysis of billions of price history records in a reasonable amount of time.
Our Team: Fletcher Stump Smith
Goal: Perform natural language processing (NLP) and document classification using Naive Bayes with scikit-learn and sparse vector representations (Scipy).
Fletcher wrote code to store and process text data, using Python and SQLite. He performed continuous testing and refactoring of existing data science code. All of this went towards building a framework for finding words relevant to specific jobs.
Our Team: Brian Kui and Tunc Yilmaz
Goal: Implement generalized linear models and neural network models to improve existing load forecasting models.
AutoGrid helps industrial customers shed power by controlling the operation power consuming devices such as water heaters. The team evaluated modifications to the forecasting models proposed by the data science team in order to help AutoGrid decide whether it is feasible to incorporate the modifications in the production code. They analyzed signals received, load, and the state of the water heaters, and identified errors in operation.
Our Team: Brian Kui and Tunc Yilmaz
Goal: Query time-series printer data that is highly unbalanced: less than 200 faults within two million time records.
Our team applied machine learning algorithms to predict rare failures of industrial printers in order to find a model to implement in production for real-time predictions.
Summit Public Schools
Our Team: Griffin Okamoto and Scott Kellert
Goal: Demonstrate the efficacy of Summit's online content assessments by using student scores on these assessments and demographic information to predict their standardized test scores.
Griffin and Scott developed a linear regression model using R and ggplot2 and presented results and recommendations for Summit's teaching model to the Information Team.
Our Team: Kailey Hoo, Griffin Okamoto, and Ken Simonds
Goal: Mine actionable insights from over 20,000 online product reviews using text analytics techniques in Python and R.
The team quantified consumer opinions about a variety of product attributes for multiple brands to assess brand strengths and weaknesses.
Stella & Dot
Our Team: Rashmi Laddha
Goal: Build predictive a model for revenue forecasting based on stylist’s cohort behavior.
Rashmi clustered stylists’ micro-segments by analyzing their behavior in initial days of joining the company and used k-means clustering on three parameters to cluster stylists. She then built a forecast model for each micro-segment in R using HoltWinters filtering and ARIMA, tuning the model to get an error rate within 5%. She also performed sensitivity analyses around changing early performance drivers in stylist’s life cycle.
Our Team: Alice Benziger
Goal: Create a popularity index for Dictionary.com’s Word of the Day feature based on user engagement data, such as page views (on mobile and desktop applications), email click-through rates, and social media (Facebook, Instagram, and Twitter) interactions.
Alice applied machine learning techniques to implement a model to predict the popularity score of new words to optimize user engagement.
Our Team: Steven Chu
Goal: Define, calculate, and analyze product features, user lifetime value, user behavior, and film success metrics.
As Fandor is a subscription-based model, their focus is to bring in more subscribers and retain current subscribers. There is a lot of potential to use metrics to segment as well as run predictions for users. Currently, one of these metrics (film score) is in production as a time-series visualization for stakeholders to see and utilize in their own decision-making processes.
Our Team: Brendan Herger
Goal: Study existing data stream to drive business decisions, and optimized data extract-transform-load process to enable future insightful real-time data analysis.
Though Lawfty’s existing pipeline had substantial outage periods and largely unvalidated data, Brendan was able to support creating a new a Spanish language vertical, creating near-real-time facilities, and contribute to better targeting AdWords campaigns.
Our Team: Brendan Herger
Goal: Build out multiple data pipelines and utilize Machine Learning to help drive RevUp’s beta product.
Brendan was able to create three new data streams which were directly put into production. Furthermore, he utilized natural language processing and machine learning to validate and parse mechanical turk output. Finally, he utilized spectral clustering to identify individual’s political affiliation from Federal Elections Commission data.
Our Team: Layla Martin and Patrick Howell
Goal: develop a machine learning model to predict a flavor label for every food in MyFitnessPal’s database.
Using primarily Python and SQL, the team built a data pipeline to better deliver subscription numbers and revenue to business intelligence units within UnderArmour.
USF’s Center for Institutional Planning and Effectiveness
Our Team: Layla Martin and Leighton Dong
Goal: Analyze influential factors in USF undergraduate student retention using logistic regression models.
The team predicted students' decisions to withdraw, continue, or graduate from USF by leveraging machine learning techniques in R. These insights have been used to improve institutional budget planning.
Our Team: Leighton Dong
Goal: Build consumer credit default risk models to support clients in managing investment portfolios.
Leighton prototyped a methodology to measure default risk using survival analysis and a cox proportional hazard model. He developed an automated process to comprehensively collect company information using Crunch Base API and store them in a NoSQL database. Leighton also engineered datasets to discover potential clients for analytics products (such as retail pricing optimization) and collected company names and other text features from Bing search result pages automatically.
Our Team: David Reilly
Goal: Examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R.
David constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. He analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.
Our Team: Steven Rea
Goal: Create an alert for YouTube channels to notify accounts when positive activity occurred on their channel.
Steven extracted video data for a sample of clients' YouTube channels. He implemented a model in Python and Postgres to predict the number of views a video would get before it was created, using historical YouTube data. Steven was also able to give insight into the behavior of YouTube videos, using machine learning techniques in Python and R.
Our Team: Cody Wild
Goal: Provide a means for ChannelMeter to leverage its 300,000-channel database to identify close niche competitors for their product's subscribers.
Cody utilized clustering and topic modeling, with a Mongo and Postgres backend, to construct a channel similarity metric that utilizes patterns of word reoccurrence to identify nearest neighbors in content space.
Our Team: Luba Gloukhova
Goal: Quantify the performance of an underlying high frequency trading strategy executed by Xambala.
Luba expanded existing internal database with data sources from Bloomberg Terminal, enabling deeper understanding of symbol characteristics underlying strategy performance. She also identified discrepancies in an end-of-day trading analysis database.
Our Team: Daniel Kuo
Goal: Developed a supervised machine learning algorithm for a Publication Authorship Linkage project to determine whether multiple publications are co-referred to the same authors or not.
Via Zephyr's DMP system, the algorithm leverages the existing institution-to-institution record linkage to easily augment new attributes and features into models. The modeling techniques used in this project include logistic regression, decision trees, and adaboost. The team used the first two algorithms to perform feature selections and then used the adaboost to improve performance.
Our Team: Monica Meyer and Jeff Baker
Goal: Develop a classification algorithm/model for the Disease Area Relevancy project that would predict and score how related a given document was to a specified disease area.
The model provides Zephyr the ability to quickly score and collect documents, as they relate to a disease, to provide resulting documents to clients. Our team explored four different algorithms to address this problem: logistic regression, bagged logistic, naïve Bayes, and random forest. Both binary and multi-label approaches were tested. The approach is scalable to include other document types.
Our Team: WeiWei Zhang
Goal: Determine disease area relevancy for medical journals using machine learning techniques.
The project began with data sampling from the PubMed database. Through natural language processing and feature engineering process, the text of abstract and title of medical documents were transformed into tokens with TF-IDF (Term Frequency, Inverse Document Frequency) scores. By leveraging the characteristics of a random forest classifier, the most important features from the feature space were selected. The body of the model was a multi-label logistic regression. The results were evaluated based on the accuracy, recall, precision, and F1 score. In short, the project is a great example of handling unlabeled data, imbalanced classes, and multi-label problems in a machine learning context.
Our Team: Kicho Yu, Spencer Boucher, and Trevor Stephens
Goal: Forecast energy use and peak power events for residential and commercial end users.
The team applied machine learning techniques to predict future demand so that power producers and facility managers could better plan for high load days. They also identified and visualized temporal/seasonal outliers in smart-meter data, and profiled various customer segmentations for a large energy utility client.
Our Team: Matt O'Brien and Prateek Singhal
Goal: Analyze the relationship between macroeconomic and hiring trends in healthcare
The team analyzed over three years of hiring data within the nursing profession, and harvested the corresponding public stock information for a large set of health companies. Their results were used to determine if hiring is a leading or lagging indicator of stock movement.
Our Team: Can Jin and Prateek Singhal
Goal: Study the chat transcripts that had been acquired for various companies over the years.
The team applied text analysis skills — learned in USF’s Text Analytics class — over a database, which was approximately 4 TB in size.
Turbo Financial Group
Our Team: Manoj Venkatesh, Dora Wang, and Li Tan
Goal: Analyze customer responses to consumer loan campaigns.
By applying a range of machine learning techniques, the team successfully overcame challenges with noisy, unbalanced data-sets and built classification models to improve the response rate of marketing campaigns.
Our Team: Deeksha Chugh, Conor O'Sullivan, and Anuj Saxena
Goal: Study the effect of changes in weather variables on the sales of consumer products.
Using machine learning methods, the team developed a predictive model to link weather variables with weekly sales trends. This model paved the way for consulting business growth into major retail brands.
Our Team: Ashish Thakur and Charles Yip
Goal: Create a production grade statistical algorithm for identifying botnet attacks
The team was given millions of records involving user activity on the Xoom website. With the data, they successfully developed and trained two time series models that distinguished between botnet and user behavior. Xoom is planning to productionize their algorithms.
Our Team: William Goldstein and Justin Battles
Goal: Determine the significance of crash signatures.
Using map-reduce to explore over 500GB of user crash reports, our team was able to build an arrival rate model to use over 400,000 incoming crash signatures and extract trend lines. Not only were they able to provide guidance on the possibilities for crash signature analysis using large volumes of data, they also went one step further and developed a dashboard to alert quality assurance team members of potential problems.
Our Team: Pete Merkouris and Rachel Philips
Goal: Study the effect of revisions to company websites and the stock returns.
Pete and Rachel were able to apply statistical regression models and graphical techniques to search for signals in the stock returns based on revisions of the webpage.
Our Team: Bingkun Li and Meenu Kamalakshan
Goal: Study customer data and identify how to effectively spend their advertising budget.
Our team was able to identify profitable customer segments using clustering and make a recommendation on which segments to target.
Our Team: Barbara Evangelista, Funmi Fapohunda, and Oscar Hendrick
Goal: Analyze the conversion, engagement and churn of mobile app users for the purpose of informing marketing strategy and app improvement.
The team applied a range of statistical techniques to study the behavior of segmented groups and their work led to the identification of trends.
Our Team: Tak Wong and Spence Aiello
Goal: Analyzed customer conversions.
Tak and Spence extracted over half a million records to study the conversion rates of users from basic to paid accounts using SQL & R, and developed regression models and applied chi-square tests to predict an increase in conversion. This result was used to target the sending of impressions to basic account users.