Skip to main content

Using NLP with Classification Models

Problem Statement:

Meredith's National Media Group reaches more than 180 million unduplicated American consumers every month, including over 80 percent of U.S. millennial women. Meredith is the No. 1 magazine operator in the U.S., and owner of the largest premium content digital network for American consumers.

They are interested in marketing to parents, especially given that Parents Magazine is one of their most popular publications. It is believed that differentiating between pregnant people and those beyond pregnancy will help the marketing team to develop future campaigns that can be more directed. My work aims to come up with a way to distinguish between these two groups and the types of things that people in those groups post about to give information to the Marketing team at Meredith’s Corporation.

I will develop multiple classification models, including a RandomForest, KNearestNeighbors, and LogisticRegression, and I will try ensembling as well to attempt to improve my model. Success will be evaluated using cross-validation, attempting to minimize false predictions and minimize overfitting (variance) of my model.

The Data:

Using Reddit's built-in API, I scraped a total of 4000 (most recent) posts, equally split amongst two subreddits: r/pregnant and r/beyondthebump.

The Process:

As with any data project, there are many things to explore and decisions about how to proceed made along the way. Especially as one newer to the field, gaining this experience of trial and learn (no errors, just observing what happens and making decisions from that information) is incredibly helpful. 

With this data, I was able to explore several different components of Natural Language Processing: CountVectorizer, TfidfVectorizer, Tokenizer, PorterStemmer, Spacy Lemmatizer, custom stop words, and more. 

I examined the top words in each subreddit, along with frequently occurring bigrams. I looked at differences in the word count of each category (as well as overlapping frequent words) and also determined that there was a significant difference in the two categories when it came to the use of question marks. 

On top of exploring all those, I also explored multiple models and parameters within each model. I fit and cross-validated on KNearestNeighbors, LogisticRegression, RandomForest, and a VotingClassifier ensemble method. Within those, I used GridSearch to optimize my model through changing hyperparameters. 

Ultimately, tokenizing, stemming, and using TfidfVectorizer were the best preprocessing steps for all of my models. The RandomForestClassifier was chosen as the best model due to its low variance and narrow confidence interval (80% accuracy +/- 2%). Though the Voting Classifier performed as well, it is complicated to interpret since it is a black box method - so choosing RandomForest made sense between the two. 

What would come next?

The next phase of the project would be to perform exploratory data analysis on the posts that the model misclassified. It may be possible to identify trends and improve our model. Following that work, it would be important to analyze the posts for specific products/services and link them to a timeline in order to improve the marketing plan desired by Meredith. Understanding at which phase parents desire information about those things can help the marketing team create a more personalized campaign. 

Comments

Popular posts from this blog

From the Classroom to Data Science

Ever since I was in 10th grade, I’ve wanted to teach high school math. I was one of the lucky few who knew exactly what my major would be upon entering college. And it was exactly the right path for me. After college, I taught for five years in a public school in the county where I grew up. Then, wanting to leave Michigan and begin life in a new place, I scored a job at a private school in Colorado. Seven years later and my family (created in Colorado) signed up for a new adventure teaching and living at a boarding school in rural New York. Little did I know that last school year in New York would be the hardest of my career.  As I struggled with the decision of whether to stay or leave and completely change my career, there were so many “what ifs” that ran through my mind.  Would this decision be on my mind if I hadn’t struggled through a year of teaching during a pandemic?  Would I want so desperately to move back to Colorado if I had formed a community or felt fully welcomed into th

Coding Challenges as a way to Level Up

 In my pre-course work for General Assembly's Data Science Immersive program, we were introduced to the website CodeWars .  The first few times I practiced my coding, I was frustrated but engaged. The way the website is set up allows you to level up as you continuously progress - like a way to see where you rank among others.  Here's what I love about it:  You can get extra practice by choosing your level of difficulty for each challenge. If you want to practice, stick with challenges at your current level. If you are looking to level up, choose something at a level above your current one. There are options for either path! CodeWars felt like a fun way to push me out of my comfort zone with coding. There is no risk at all to trying something that is just beyond my current level.  There are so many programming languages available! Want to dabble in Haskell? Ruby? SQL? They've got you covered. (From what I counted, there are 29 core languages and 26 beta languages currently s

Fail Your Way to Success

As a recovering perfectionist, I struggle with anything less than perfect. Coming into an immersive Data Science program was like being blind-sided. Python coding is essentially failing over and over again and learning from those small mistakes.  Forget a colon? Didn't indent? That will throw an error.  In many ways, I have learned about changing my approach to life through my experience learning Python. Not an hour goes by where I don't throw some kind of error or have to stop and debug my code. Or, since I'm still learning so much daily, I come across a need to do something that I don't quite know how to do. I've learned to strategically Google (as any coder/engineer can understand and appreciate!)  Isn't this just like our lives? I've been on a healthy person journey now for some time. Prior to working with a coach, I had mastered the 'all or nothing' approach. Instead, as a healthy person, I pause regularly and evaluate my errors. Where did I go