Skip to main content

Using NLP with Classification Models

Problem Statement:

Meredith's National Media Group reaches more than 180 million unduplicated American consumers every month, including over 80 percent of U.S. millennial women. Meredith is the No. 1 magazine operator in the U.S., and owner of the largest premium content digital network for American consumers.

They are interested in marketing to parents, especially given that Parents Magazine is one of their most popular publications. It is believed that differentiating between pregnant people and those beyond pregnancy will help the marketing team to develop future campaigns that can be more directed. My work aims to come up with a way to distinguish between these two groups and the types of things that people in those groups post about to give information to the Marketing team at Meredith’s Corporation.

I will develop multiple classification models, including a RandomForest, KNearestNeighbors, and LogisticRegression, and I will try ensembling as well to attempt to improve my model. Success will be evaluated using cross-validation, attempting to minimize false predictions and minimize overfitting (variance) of my model.

The Data:

Using Reddit's built-in API, I scraped a total of 4000 (most recent) posts, equally split amongst two subreddits: r/pregnant and r/beyondthebump.

The Process:

As with any data project, there are many things to explore and decisions about how to proceed made along the way. Especially as one newer to the field, gaining this experience of trial and learn (no errors, just observing what happens and making decisions from that information) is incredibly helpful. 

With this data, I was able to explore several different components of Natural Language Processing: CountVectorizer, TfidfVectorizer, Tokenizer, PorterStemmer, Spacy Lemmatizer, custom stop words, and more. 

I examined the top words in each subreddit, along with frequently occurring bigrams. I looked at differences in the word count of each category (as well as overlapping frequent words) and also determined that there was a significant difference in the two categories when it came to the use of question marks. 

On top of exploring all those, I also explored multiple models and parameters within each model. I fit and cross-validated on KNearestNeighbors, LogisticRegression, RandomForest, and a VotingClassifier ensemble method. Within those, I used GridSearch to optimize my model through changing hyperparameters. 

Ultimately, tokenizing, stemming, and using TfidfVectorizer were the best preprocessing steps for all of my models. The RandomForestClassifier was chosen as the best model due to its low variance and narrow confidence interval (80% accuracy +/- 2%). Though the Voting Classifier performed as well, it is complicated to interpret since it is a black box method - so choosing RandomForest made sense between the two. 

What would come next?

The next phase of the project would be to perform exploratory data analysis on the posts that the model misclassified. It may be possible to identify trends and improve our model. Following that work, it would be important to analyze the posts for specific products/services and link them to a timeline in order to improve the marketing plan desired by Meredith. Understanding at which phase parents desire information about those things can help the marketing team create a more personalized campaign. 

Comments

Popular posts from this blog

From the Classroom to Data Science

Ever since I was in 10th grade, I’ve wanted to teach high school math. I was one of the lucky few who knew exactly what my major would be upon entering college. And it was exactly the right path for me. After college, I taught for five years in a public school in the county where I grew up. Then, wanting to leave Michigan and begin life in a new place, I scored a job at a private school in Colorado. Seven years later and my family (created in Colorado) signed up for a new adventure teaching and living at a boarding school in rural New York. Little did I know that last school year in New York would be the hardest of my career.  As I struggled with the decision of whether to stay or leave and completely change my career, there were so many “what ifs” that ran through my mind.  Would this decision be on my mind if I hadn’t struggled through a year of teaching during a pandemic?  Would I want so desperately to move back to Colorado if I had formed a community or felt fully welcomed into th

Allocation of Funds for Addiction in the United States

Overview :  The number of Americans suffering from addiction is steadily increasing and many agencies and departments throughout the federal government work to distribute funds across the nation to assist those Americans who are addicted as well as to prevent others from becoming addicts. SAMHSA is one of these agencies, housed under the U.S. Department of Health and Human Service, whose mission is "to reduce the impact of substance abuse and mental illness on America's communities" [ source ]. That said, like every government agency, there are limited resources and time that SAMHSA has and we aim to assist in SAMHSA making the greatest impact in the shortest time possible. Last year, SAMHSA distributed its' grant funding based on the percent of the US population that resides in a particular state; however, our analysis found that this metric does not line up with states that have high crude rates (deaths per 100,000) indiciating that this vital funding is not reachin