The Master of Science in Analytics (MSAN) program at the University of San Francisco is an innovative and challenging training program for developing future analytics professionals. The MSAN Program invites you to find out more about how your organization might benefit working with a group of highly motivated and committed students from this exciting graduate program.
The program seek clients interested in working with one or more teams of students on an Analytics Practicum project. These projects are centered on creating business value from data and extend over several months.
As a central part of our 12-month curriculum, USF's MSAN program provides training in the leadership, teamwork, communication, professionalism and project execution skills that are vital for analysts and data scientists to be effective in the business world. We teach our students these skills through lectures, readings, workshops, written assignments, and simulations.
But there is simply no substitute for experience. Throughout the program, our student teams work on a real project with real (and usually messy) data from a client organization. Our faculty also manages and mentors the students to ensure client success.
In short, these "Practicum Projects" give our students invaluable first-hand experience as analytics professionals, while providing clients with tangible benefits from their hard work, knowledge and discoveries.
The MSAN program demands a great deal from its students, including a strong commitment to client success. In addition, we work very hard to make sure clients are committed to the success of their Practicum project. While there are no hard and fast rules, from our experience we believe the most successful projects:
Learn more about previous Practicum projects
The MSAN program is now in its fourth year and we are pleased to report many successful and ongoing relationships with Dictionary.com, Williams-Sonoma, Xoom, Zephyr Health, Autogrid, Uber, Clorox and more.
Sandeep Vanga ('15) worked with Google to perform unsupervised text clustering on news articles of certain brands/topics of interest to Google to gain insights into the most representative sub topics. He built a baseline model using Kmeans clustering and tfidf features. He also devised two variants of Word2Vec (deep learning based features) models. First method is based on aggregation of word vectors and second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year duration. Various quantitative metrics such as entropy, silhouette score etc and visualization techniques were used to validate the algorithms. BoClu based Word2Vec model outperforms other methods consistently but it is very sensitive to number of clusters. Using more robust method to choose optimal number of clusters along with BoClu will further improve the quality of clusters.
Sandeep Vanga ('15) and Rachan Bassi ('15) worked with Williams-Sonoma. W&S product feed contains more than a million images and the corresponding meta data such as color, pattern, type of image (catalog/multiproduct/singleproduct) is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram based computer vision techniques to segment and identify important regions/features of an image. Further decision tree based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/singleproduct images and 70% accuracy in case of complex multiproduct/catalog images.
Michaela Hull ('15) worked with LA County to find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction. This employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and all sorts of machine learning techniques.
WeiWei Zhang ('15) worked with Zephyr Health. The core of the practicum at Zephyr Health is a machine learning project focuses on determining the disease area relevancy for the medical journals. The project began with data sampling from the PubMed database. Through natural language processing and feature engineering process, the text of abstract and title of medical documents were transformed into tokens with TF-IDF (Term Frequency, Inverse Document Frequency) scores. By leveraging the characteristics of a random forest classifier, the most important features from the feature space were selected. The body of the model was a multi-label logistic regression. The model result were evaluated based on the score of accuracy, recall precision and F1. In short, the project at Zephyr Health has become a great example of handling no labeled data, imbalanced classes and multi-label problems in a machine learning realm.
Florian Burgos ('15) and Dan Loman ('15) worked with Flyr to use machine learning to predict the price of connecting flights based on the price of the one-ways. Improved user engagement on the website by displaying content on the landing page with d3. Content computed overnight using distributed computing on an AWS ec2 instance to find the best deals in the US by origin.
Matt Shadish ('15) worked with Convergence Investment Management to apply machine learning techniques to improve on an existing trading strategy. In particular, he used Python and pandas to incorporate external variables and build cross-sectional models to an inherently time series problem. Created visualizations of current trading strategy performance using ggplot2 in R.
Matt Shadish ('15) worked with Engage3 to perform analysis of historical retail product prices across stores using Python. Created visualizations of these analyses in Matplotlib. Applied the analysis as a functional solution (using RDD’s and DataFrames) so as to take advantage of Apache Spark. This allowed the ability to analyze billions of price history records in a reasonable amount of time.
Fletcher Stump Smith ('15) worked with LiveCareer.com to perform natural language processing (NLP) and document classification using Naive Bayes with scikit-learn and sparse vector representations (Scipy). Wrote code for storing and processing text data, using Python and SQLite. Performed continuous testing and refactoring of existing data science code. All of this went towards building a framework for finding words relevant to specific jobs.
Brian Kui ('15) and Tunc Yilmaz ('15) worked with AutoGrid to implement generalized linear models and neural network models to improve the existing load forecasting models at AutoGrid. Evaluated modifications to the forecasting models proposed by the data science team in order to help them decide whether it is feasible to incorporate the modifications in the production code. AutoGrid helps with its demand response program large, industrial customers to shed power by controlling power consuming devices such as water heaters. They analyzed the signals received, the load and the state of the water heaters and pointed out to mistakes in the operation.
Brian Kui ('15) and Tunc Yilmaz ('15) worked with Danaher Labs to query time series printer data that is extremely unbalanced with less than 200 fault examples within more than 2 million time records that are not faults. They applied machine learning algorithms to predict rare failures of industrial printers in order to find a model to implement in production for real-time predictions.
Griffin Okamoto ('15) and Scott Kellert ('15) worked with Summit Public Schools to demonstrate the efficacy of Summit's online content assessments by using student scores on these assessments and demographic information to predict their standardized test scores. Developed a linear regression model using R and presented results and recommendations for Summit's teaching model to the Information Team.
Kailey Hoo ('15), Griffin Okamoto ('15) and Ken Simonds ('15) worked with Clorox to mine actionable insights from over 20,000 online product reviews using text analytics techniques in Python and R. Quantified consumer opinions about a variety of product attributes for multiple brands to assess brand strengths and weaknesses.
Rashmi Laddha ('15) worked with Stella & Dot to build predictive a model for revenue forecasting based on stylist’s cohort behavior. Clustered stylists’ micro-segments by analyzing their behavior in initial days of joining the company. Used k-means clustering on three parameters to cluster stylists. Built a forecast model for each micro-segment in R using HoltWinters filtering and ARIMA. Optimized the model to get error rate within 5%. Performed sensitivity analyses around changing early performance drivers in stylist’s life cycle.
Chandrashekar Konda ('15) worked with GE Software to solve three complex tasks in Sourcing big data project for GE. Parts Normalization - Identified similar mechanical parts out of 5M parts in Oil Rig design versions for GE Oil & Gas business using Hadoop and Elastic search. Payment Terms Mapping - Identified the best Payment terms from 1M payment terms across GE’s different businesses, using Python and Talend. Sourcing - Compared over 1.8 million purchase transactions with 50k GE’s products to ascertain whether GE can benefit if all materials are procured from other GE subsidiaries, using Python.
Alice Benziger ('15) worked with Dictionary.com to create a popularity index for Dictionary.com’s Word of the Day feature based on user engagement data like page views (on mobile and desktop applications), email CTRs and social media (Facebook, Instagram & Twitter) interactions. Applied machine learning techniques to implement a model to predict the popularity score of new words to optimize user engagement.
Steven Chu ('15) worked with Fandor to define, calculate, and analyze product features, user lifetime value, user behavior, and film success metrics. The main goal was to understand context in which business decisions are made at Fandor. In general, as Fandor is a subscription-based model, their focus is twofold: 1) bringing in more subscribers, and 2) keeping current subscribers around. There is lots of potential to use these metrics to both segment as well as run predictions for users. Currently, one of these metrics (film score) is in production as a time-series visualization for stakeholders to see and utilize in their own decision-making processes.
Brendan Herger ('15) worked with Lawfty to study existing data stream to drive business decisions, and optimized data extract-transform-load process to enable future insightful realtime data analysis. Though Lawfty’s existing pipeline had substantial outage periods and largely unvalidated data, he was able to support creating a new a Spanish language vertical, creating near-realtime facilities, and contribute to better targeting AdWords campaigns.
Brendan Herger ('15) worked with RevUp to build out multiple data pipelines and utilize Machine Learning to help drive RevUp’s beta product. He was able to create 3 new data streams which were directly put into production. Furthermore, he utilized Natural Language Processing and Machine Learning to validate and parse Mechanical Turk output. Finally, he utilized Spectral Clustering to identify individual’s political affiliation from Federal Elections Commission data.
Layla Martin ('15) and Patrick Howell ('15) worked with MyFitnessPal.com to develop a machine learning model to predict a flavor label for every food in MyFitnessPal’s database, primarily using Python and SQL. Built a data pipeline to better deliver subscription numbers and revenue to business intelligence units within UnderArmour.
Layla Martin ('15) and Leighton Dong ('15) worked with USF's Center for Institutional Planning and Effectiveness to analyze influential factors in USF undergraduate student retention using logistic regression models. Predicted students' decision to withdraw, continue, or graduate from USF by leveraging machine learning techniques in R. These insights have been used to improve institutional budget planning.
Leighton Dong ('15) worked with Quiota to build consumer credit default risk models to support clients in managing investment portfolios. Prototyped a methodology to measure default risk using survival analysis and cox proportional hazard model. Developed an automated process to comprehensively collect company information using Crunch Base API and store them in NoSQL database. Engineered datasets to discover potential clients for analytics product such as retail pricing optimization. Collected company names and other text features from Bing search result pages automatically.
David Reilly ('15) worked with Uber to examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R. Constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. Analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.
Steven Rea ('15) worked with ChannelMeter to create an alert for YouTube channels to notify accounts when positive activity happened on their channel. Researched by extracting all video data for a sample of clients' YouTube channels. Implemented in python and postgres. Predicted the number of views a video would get before it was created, looking at previous YouTube data. Also able to give insight into behavior of YouTube videos. Implemented with machine learning techniques in python and R.
Steven Rea ('15) and Cody Wild ('15) worked with Baye’s Impact to analyze software logs to determine how customers were using their product, segmenting customers into different groups for analysis. Also offered advice on how to achieve statistically significant results for future analyses. Implemented in python and postgres.
Cody Wild ('15) worked with ChannelMeter to provide a means for ChannelMeter to leverage its 300,000-channel database to identify close niche competitors for their product's subscribers. Utilized clustering and Topic Modeling, with a Mongo and Postgres backend, to construct an ultimate channel similarity metric that utilizes patterns of word reoccurance to identify nearest neighbors in content space.
Luba Gloukhova ('15) worked with Xambala to quantify the performance of an underlying high frequency trading strategy executed by Xambala. Expanded existing internal database with data sources from Bloomberg Terminal enabling deeper understanding of symbol characteristics underlying strategy performance. Identified discrepancies in end of day trading analysis database.
Daniel Kuo ('15) worked with Zephyr Health on a Publication Authorship Linkage project to develop a supervised machine learning algorithm to determine whether two (multiple) publications are co-referred to the same authors or not. Via Zephyr's DMP system, the algorithm leverages the existing institution to institution record linkage to easily augment new attributes and features into models. The modeling techniques used in this project include logistic regression, decision tree, and adaboost. We use the first two algorithms to perform feature selections and then use the adaboost to boost the models' performance. We went through 3 modeling iterations with the sample from the entire population for the first two runs while the sample from a specific group (publications from western authors with affiliations) for the last. The final evaluations were attained by testing three different models on the unseen testing set from the specific population. The third model is far better than the first two in terms of F0.5 scores (0.44, 0.67, 0.88). Therefore, we suggest the company to build separate models for different sub-populations in order to achieve better results.
Monica Meyer ('15) and Jeff Baker ('15) worked with Zephyr Health on the Disease Area Relevancy project to develop a classification algorithm/model that would predict and score how related a given document was to a specified disease area. The model provides Zephyr with the ability to quickly score and collect documents, as they relate to a requested disease, to provide resulting documents to clients. Our team explored four different algorithms to address this problem: logistic regression, bagged logistic, naïve Bayes, and random forest. Both binary and multi-label approaches were tested. The binary logistic regression was the best performer. This approach can be scaled to include other document types.
Kicho Yu ('14)
, Spencer Boucher ('14)
and Trevor Stephens ('14)
worked with AutoGrid
to forecast energy use and peak power events for residential and commercial end users. They applied machine learning techniques to predict future demand so that power producers and facility managers could better plan for high load days. They also identified and visualized temporal/seasonal outliers in smart-meter data, and profiled various customer segmentations for a large energy utility client.Matt O'Brien ('14)
and Prateek Singhal ('14)
worked with SimplyHired
to analyse the relationship between macroeconomic and hiring trends in healthcare. They analysed over 3 years of hiring data within the nursing profession, and harvested the corresponding public stock information for a large set of health companies. Their results were used to determine if hiring is a leading or lagging indicator of stock movement.Can Jin ('14)
and Prateek Singhal ('14)
worked with Support.com
to study the chat transcripts that had been acquired for various companies over the years. They applied Text Analysis skills learned in the 'Text Analytics' class over a database which was approximately 4 TB in size.Manoj Venkatesh ('14)
, Dora Wang ('14)
and Li Tan ('14)
worked with Turbo Financial Group
to analyze customer responses to consumer loan campaigns. By applying a range of machine learning techniques, they successfully overcame challenges with noisy, unbalanced data-sets and built classification models to improve the response rate of marketing campaigns. Deeksha Chugh ('14)
, Conor O'Sullivan ('14)
, and Anuj Saxena ('14)
worked with Weather.com
to study the effect of changes in weather variables on the sales of consumer products. Using machine learning methods, the team developed a predictive model to link weather variables with weekly sales trends. This model paved the way for consulting business growth into major retail brands.Ashish Thakur ('14)
and Charles Yip ('14)
worked with Xoom Corporation
to create a production grade statistical algorithm for identifying botnet attacks. They were given millions of records involving user activity on the Xoom website. With the data, they successfully developed and trained two timeseries models that distinguished between botnet and user behavior. Xoom is planning to productionize their algorithms.
Students work from 10 to 15 hours a week for a company throughout their program (actually, 10 of their 12 months). Students can work as paid interns on your payroll or via a mutual NDA and data analysis agreement between your company and USF. A faculty member on our side typically acts as project lead with the students doing the work. We need someone in the client company that acts as a champion of the project, providing answers to questions, data, and so on. We usually work on a granularity of one semester but students have often continued with the same company through multiple semesters.
The first step is to discuss possible projects from your organization and how they might fit with the goals of the MSAN program. Professors Intrevado and Interian can provide additional information and answer any questions: