How do I evaluate an impact of a word for an overall joke rating? I have a table of jokes with rating. I would like to analyze which words are most common in the best jokes. How do I do that in Python in R?
This is really a text analysis problem. I have done some similar types of text analysis in R, though on survey data rather than jokes. So, here is how I have gone about it in my work:
· Get the book (on Amazon or on the web, where there is a free copy (or was)) called Text Mining in R - A Tidy Approach. It will be a guide to the entire process. It takes a while to thoroughly understand it, but it is worth the effort.
· Read in your data, with the corpus being “jokes”. A corpus is just a list of documents or other text matter at the appropriate unit of analysis. In your case, that is “joke”.
· Tokenize your corpus, at the word level, since you are interested in word frequency. “Tokenize” means breaking each joke into its component words, with an identifier to tie each word in a joke to that joke.
· Now you have to eliminate “stop words”. Those are words like “a”, “the”, “to” and other words whose function is more grammar than content. It won’t be helpful if you find out that the most common word in the funniest jokes are “the” and “a”. There are lists of stop words that you can use, and you can add your own to those lists, if there are special considerations about the usefulness of some words (which there well may be with a corpus of jokes).
· You might also tokenize at the “phrase” level (a sequence of words), as that might be useful for your applications (e.g. “take”, “my”, “wife” and “please” aren’t intrinsically amusing but the phrase “take my wife please” has been historically considered a funny line).
· Once you have tokenized, you can run word counts or word clouds, subsetting your data on the rated level of the funniness of the joke. If there are words that are more commonly attached to particularly funny jokes, they should show up in the word count lists or word clouds.
· You may want to be careful about the length of the jokes, which could make some words seem funnier than others, if the jokes in which they are embedded are just longer (so that word gets repeated more often). So, you might want to normalize that in some way (e.g. what proportion of all the words in a joke does that word represent).
- While you are at it, you might also try sentiment analysis. That links words to their sentiment value - i.e. besides looking for word frequency, you might want to evaluate the overall sentiment of the jokes by their rated humor value. There are publicly available dictionaries for the purpose of evaluating the sentiments that tend to be linked to words (e.g. is a word normally considered “happy”, “sad”, “scary”, etc.). For example, you might discover that jokes that have an overall “happy” sentiment are funnier than those that have an overall “scary” sentiment. Or the reverse - that might be an interesting thing to analyze.
- Then there is topic modelling, which is pretty tricky to analyze, but it attempts to discover the underlying themes in your corpus. It is a bit like cluster analysis or factor analysis in statistics, which require a fairly high degree of subjective interpretation by the analyst. It might be useful or it might not – you won’t know until you try.
No comments:
Post a Comment