Dr. G's Blog A blog about data, science, and...:
    About     Consulting     Archive     Feed

Metis Project 4

The fourth project at Metis focused on natural language processing and the clustering algorithms that make it work.

Problem Selection and Data

I have the questionable honor of having owned a 1987 Jeep Wrangler since 2008.

jeep

That makes JeepForum a website I am very familiar with and it struck me as a potentially fascinating source of text to study.

Scraping

JeepForum has no robots.txt file at all. I wondered what would happen, but I was able to just scrape away with only the rare few “server didn’t respond” problems. As blitzed as I was after Project 3 and my Air Force Research Lab project, and as intense a pace as the lectures were maintaining, and with trying to actually keep up with the challenge problem sets / homework for a while, it took me until the end of the week before Project 4 was due to get my web scraper code working. I still have “scrape, scrape, scrape the web” to the tune of Row, Row, Row Your Boat going in my head a week and a half later. I wrote the code to start at the oldest (and most stable) page of the thread list for the YJ forum and spider its way forward in time across the Next Thread links, popping one thread’s worth of posts at a time, placing their text and metadata into dictionaries, and then pickling those lists of dictionaries once a thread pushed the post count above 100. I set up a baby free-tier AWS instance to scrape the TJ forum in tandem with my Ubuntu desktop scraping the YJ forum. By the time I stopped the scrapers, I had over 2,000,000 posts saved. I plan to finish scraping the two forums and use the complete dataset for an extension to this project in the coming months.

Preprocessing

I fought spaCy quite a bit. All I wanted to do with the text was to lemmatize it (i.e. convert inflected forms to a single common root word, and how did “lemma” get repurposed to mean this when I wasn’t looking?…) and drop an enhanced list of stopwords (“jeep” is not always a stopword, but in this dataset it is) before dumping the data as unigrams into an NLP algorithm. Unfortunately I could not get spaCy to do anything this simple. There are probably NLTK tools I could use, but I didn’t have time for a thorough exploration. In the end I added the stopwords to spaCy’s list, then exported them and had scikit-learn’s Count Vectorizer (does he have fangs, and can you see him in mirrors?) drop the stopwords instead.

Processing

I tried every standard algorithm we were taught at Metis: SVD, NMF, LDA, and CorEx. Like a lot of my classmates, I wound up using CorEx after poking around with the output of all of these. There is a lot I don’t understand about the CorEx algorithm, and that makes me itchy, but a bootcamp schedule doesn’t really allow for that level of comprehension of most of the concepts. Another thing that made me itchy was how at the ~1000 post level where I did a lot of this poking around, none of the algorithms really did a great job of picking out topics.

I put up my pickles on a Google instance, burning some of that $300 credit for a quad core processor equivalent that probably outstripped my aging desktop by a considerable margin. I was able to store all the post data and metadata in mongoDB and then sample it. I elected to cut down the ~1,000,000 posts I had uploaded to the instance to about 50,000 posts to process, because I was worried about time (of course it was the afternoon before the project was due at this point), and chose them by picking every 20th day from November 2000 forward and grabbing the posts from that day.

Fortunately, on that long and time-balanced sample, and with my completely arbitrary choice of 20 topics, CorEx did a really good job of picking out recognizable word clusters, which you can see in my presentation slides.

Presentation

I surprised myself by getting a Dash (plotly’s extension of Flask) app up and running for the presentation. It shows the Total Correlation (and I will find out what that statistic is actually measuring in the months to come) from CorEx for each topic, as well as a hack I called Pseudocount for each topic where I summed the probabilities for every document (post) / topic combination down the columns and then normalized them to the total number of posts I’d processed. That was my best guess at how many posts were in each topic. Since posts really belong to more than one conceptual space, it makes sense to assign them fractionally to topics, and CorEx spits out numbers that, at least if you’re ignorant enough of the innards of the algorithm, could be taken as non-normalized topic “intensities”, if you will.

Have I made it clear how much I want to understand this better? Of course, the very fact of throwing this together on an actual problem is what really drove me to want that.

The other side of the dashboard had the actual interactive component, a slider to pick topics and change the display to show a graph of word… scores?… for each of the top 20 words in the topic, as well as the post that scored highest. Let me tell you, lemmatized JeepForum posts are a scream. I should figure out how to display both the native text and the lemmatized form in my extended cut.

The Future

This was too much fun. I really want to spend a ruinous amount of time playing with this data and building a bigger and better app / set of apps, especially a post recommender that will take your JeepForum profile name (or anyone’s!) and recommend posts to you based on your existing activity, and a topics-by-time-period breakdown. The first step after the bootcamp will be to figure out how to get the app safely onto Heroku or something so y’all can play with it.

–PAG

Blogging platform assembled by Jekyll, Poole, and Zach Miller of Metis.