The Data Institute seeks members interested in working with one or more teams of students on an MS in Data Science (MSDS) internship project. Projects are centered on creating business value through data science and last approximately nine months. Our students and faculty make a strong commitment to client success. A USF faculty member acts as project mentor, guiding students through the project. A project champion from the practicum company provides answers, business context and data.
The Membership Advantage
Our students are in high demand - become a member to ensure placement.
The first step is to discuss possible projects from your organization and how they might fit with the goals of the MSDS program. Our Director of Partnerships can provide additional information and answer any questions:
Director of Partnerships
The MS in Data Science (MSDS) sends internship teams to companies tackling data science problems in industries ranging from high-frequency trading to hospitality and energy efficiency to transportation.
Our partner organizations range from small start-ups to established Bay Area technology firms as well as large civic and nonprofit organizations.
Our Team: Ben Miroglio and Chhavi Choudhry
Goal: Cluster web sessions to segment users and improve the flow of Airbnb's website and mobile app.
Ben and Chhavi employed machine learning techniques to identify features indicative of positive outcomes using R and Python. They built interactive web session visualizer using D3.js to identify key differences among different segments of users and to identify bottlenecks in the session journey.
Our Team: Vincent Pham and Brynne Lycette
Goal: Employ machine learning techniques for credit card fraud detection and build a data unification platform.
Capital One's fraud team has collected and built more than two hundred features relevant to classifying fraudulent credit card transactions. Vincent and Bree employed various machine learning techniques using H2O and Dato in order to evaluate software robustness and increase accuracy of fraud prediction. They also implemented a NoSQL data store and a higher level in-memory storage system to unify various streaming and batch processes.
Our Team: Meg Ellis and Jack Norman
Goal: Create a price-suggestion model to assist event organizers in optimizing ticket sales and revenue.
Identifying important features that most influence ticket prices, Meg and Jack implemented a K Nearest Neighbors model that clusters events with similar characteristics, and subsequently leveraged the distribution of costs of these similar, successful events to suggest an appropriate range of ticket prices that the organizer can use when creating their event. Flask was subsequently used to create a web application to allow users to interact with the model.
Our Team: Sandeep Vanga
Goal: Perform unsupervised text clustering to gain insights into representative sub-topics.
Sandeep built a baseline model using k-means clustering and tfidf features. He also devised two variants of Word2Vec (deep learning-based features) models. The first method is based on aggregation of word vectors and the second method is based on Bag of Clusters (BoClu) of words. He also implemented elbow method to choose optimal number of clusters. These algorithms are validated on 10 different brands/topics using the news data collected over one year. Various quantitative metrics such as entropy, silhouette, score, etc. and visualization techniques were used to validate the algorithms.
Our Team: David Reilly
Goal: Examine over 300,000 trips in the city of San Francisco to study driver behavior using SQL and R.
David constructed behavioral and situational features in order to model driver responses to dispatch requests using advanced machine learning algorithms. He analyzed cancellation fee refund rates across multiple cities in order to predict when a cancellation fee should be applied using Python.
Our Team: Sandeep Vanga and Rachan Bassi
Goal: Automate the process of image tagging by employing image processing as well as machine learning tools.
Williams-Sonoma’s product feed contains more than a million images and the corresponding meta data — such as color, pattern, type of image (catalog/multiproduct/single-product) — is extremely important to optimize the search and product recommendations. They automated the process of image tagging by employing image processing as well as machine learning tools. They used image saliency and color histogram-based computer vision techniques to segment and identify important regions/features of an image. A decision tree-based machine learning algorithm was used to classify the images. They were able to achieve 90% accuracy in case of silhouette/single-product images and 70% accuracy in case of complex multiproduct/catalog images.
Nonprofit and Civic Organizations
Our Team: Michaela Hull
Goal: Find duplicate voters using exact and fuzzy matching, feature engineering such as distances between two points of interest, trolling the Census Bureau website for potentially useful demographic features, and classification models, all in the name of poll worker prediction.
Michaela employed the use of distributed computing, the Google Maps API, relational databases, dealing with large databases (~5 million observations), and a variety of machine learning techniques.
Our Team: Griffin Okamoto and Scott Kellert
Goal: Demonstrate the efficacy of Summit's online content assessments by using student scores on these assessments and demographic information to predict their standardized test scores.
Griffin and Scott developed a linear regression model using R and ggplot2 and presented results and recommendations for Summit's teaching model to the Information Team.
Our Team: Alex Romriell, & Swetha Venkata Reddy
Goal: Create a model to detect conjunctivitis outbreaks.
Alex and Swetha performed text and geo-spatial analysis on over 300,000+ tweets to detect local pinkeye outbreaks. They created a framework for identifying tweets directly related to conjunctivitis. Surges in outbreaks were mapped to clinical records nationwide. Time series analysis of the tweets revealed similar trends and seasonality compared to the actual hospital data. Text analysis techniques such as like Latent Symantec analysis on AWS were employed to filter noise from the data. A multinomial Naive Bayes model was also developed based on TFIDF scores of the tweets to predict the sentiment.