Problem Statement:
Meredith's National Media Group reaches more than 180 million unduplicated American consumers every month, including over 80 percent of U.S. millennial women. Meredith is the No. 1 magazine operator in the U.S., and owner of the largest premium content digital network for American consumers.
They are interested in marketing to parents, especially given that Parents Magazine is one of their most popular publications. It is believed that differentiating between pregnant people and those beyond pregnancy will help the marketing team to develop future campaigns that can be more directed. My work aims to come up with a way to distinguish between these two groups and the types of things that people in those groups post about to give information to the Marketing team at Meredith’s Corporation.
I will develop multiple classification models, including a RandomForest, KNearestNeighbors, and LogisticRegression, and I will try ensembling as well to attempt to improve my model. Success will be evaluated using cross-validation, attempting to minimize false predictions and minimize overfitting (variance) of my model.
The Data:
Using Reddit's built-in API, I scraped a total of 4000 (most recent) posts, equally split amongst two subreddits: r/pregnant and r/beyondthebump.
The Process:
As with any data project, there are many things to explore and decisions about how to proceed made along the way. Especially as one newer to the field, gaining this experience of trial and learn (no errors, just observing what happens and making decisions from that information) is incredibly helpful.
With this data, I was able to explore several different components of Natural Language Processing: CountVectorizer, TfidfVectorizer, Tokenizer, PorterStemmer, Spacy Lemmatizer, custom stop words, and more.
I examined the top words in each subreddit, along with frequently occurring bigrams. I looked at differences in the word count of each category (as well as overlapping frequent words) and also determined that there was a significant difference in the two categories when it came to the use of question marks.
On top of exploring all those, I also explored multiple models and parameters within each model. I fit and cross-validated on KNearestNeighbors, LogisticRegression, RandomForest, and a VotingClassifier ensemble method. Within those, I used GridSearch to optimize my model through changing hyperparameters.
Ultimately, tokenizing, stemming, and using TfidfVectorizer were the best preprocessing steps for all of my models. The RandomForestClassifier was chosen as the best model due to its low variance and narrow confidence interval (80% accuracy +/- 2%). Though the Voting Classifier performed as well, it is complicated to interpret since it is a black box method - so choosing RandomForest made sense between the two.
What would come next?
The next phase of the project would be to perform exploratory data analysis on the posts that the model misclassified. It may be possible to identify trends and improve our model. Following that work, it would be important to analyze the posts for specific products/services and link them to a timeline in order to improve the marketing plan desired by Meredith. Understanding at which phase parents desire information about those things can help the marketing team create a more personalized campaign.
Comments
Post a Comment