Friday, 5 February 2016

Part 2 of a Review of “Marketing Analytics – A Practical Guide to Real Marketing Science” (by Mike Grigsby Kogan)

Part 2 of a Review of “Marketing Analytics – A Practical Guide to Real Marketing Science” (by Mike Grigsby Kogan)

A while back, I got a book from my Skillsoft learning library, with the above title.  As a statistician/analyst at a university, I was curious about how the statistical techniques that I use on a routine basis are applied in the marketing world.  And as someone who is involved in a small publishing venture, I was also curious about the theory and practice of marketing in general, and how it might be used to sell more novels :).  So, I thought I would read the book and do a write-up for the blog, to help fix ideas in my own mind and inform blog readers as well.

Naturally, if the book interests you, you should go to the source.  The Amazon link is given above.  The book sells for about 20 bucks, in both e-book and paperback form.  Though the content gets somewhat technical, given the subject matter, the writer maintains a very readable style in my opinion.

Since the book is fairly long, a proper look at it will take at least two blogs, maybe three.  I previously did a blog on Part One of the book, which was concerned with some elementary statistical ideas, as well as some fundamental concepts and strategies within the marketing world.  What follows is my synopsis of Part Two of the book, in point form.    This section deals with some fairly advanced statistical techniques, in the “predictive analytics” realm, namely:

·         Multiple regression,

·          Logistic regression,

·         Survival analysis,

·         And econometric modelling.

In some cases, I have inserted an example of a given technique, from my own book related research, in italics.

Part Two – Dependent Variable Techniques

Chapter 3 – Modeling Dependent Variable Techniques - the things that drive demand

·         This chapter focuses on techniques that use equations that predict a dependent variable, based on the values of one or more independent variables.  These are generally known as regression techniques or general linear models.  The author gives a decent explanation of this, even mentioning a few subtle nuances, such as the use of dummy variables and price elasticity (there was a substantial technical section on this subject).  But this is a very brief review of a very large subject, so it is difficult to say whether a newcomer to the concepts would be able to really understand it well.

·         Here’s a quick example of a simple binary (two variable) regression relationship, from some book  publishing related data that I dug out of the Amazon website.  For a selection of 18 books on the “Alsobot” of one our books (Kati of Terra Book 1), I analyzed the book summaries with some text analysis software that counts the number of “hard words” in a sample of text.  Then, I did the same for a sample of reviews of each of the books (about 20 reviews for each book) and averaged them.  The graph below shows how the complexity of the book summary is related to  the complexity of the reviews for that book.

 A simple binary (dependent variable and one independent variable)  regression (using Excel) of the complexity of the book summary (called a blurb on the graph axis) and the complexity of the reviews for that book show that they are related, in an approximately linear fashion (the straight line). The graph shows the regression relation in equation form and the strength of the association in “R-square” form.  An R-square close to 1 indicates a very good positive linear relationship between the variables, and R-square close to 0 indicates no linear relationship, and an R-square close to -1 indicates a very good negative linear relationship.  In this case, the R-square is nearly 0.5, which indicates a reasonable strong relationship.

What does it tell you?  Briefly, I interpret it to mean that people read books that are written at the level with which they are comfortable, and that also corresponds to the level at which they generally write (reviews in this case).  More precisely, they read books that have summaries written at that level, but we can reasonably assume that  the summary is probably written at much the same level of complexity as the book.  The book’s summary signals to potential readers how difficult the book is likely to be, in a vocabulary sense.  Then, people respond (perhaps unconsciously) to that cue, and pick a book that corresponds to their vocabulary, and eventually review it with a similarly complex vocabulary.

This is a very simple model, since it only includes two variables and assumes a linear relationship.  Much more complicated models are possible, which could include dozens of variables (assuming a sufficiently rich dataset) as well as non-linear terms, interaction effects and other complexities.  But this simple case gets the idea across.

Chapter 4 – Who is Most Likely to Buy and How do I Target?

·         This continues the focus on methods that predict a dependent variable from one or more independent variables.  In this case, the independent variable is of the binary or yes/no variety, so the method under review is known as logistic regression.  In this case, the equation predicts the odds, or probability that a particular outcome will occur, namely that a person will buy the product or service in question.  Note that other outcome could be of interest, such as clicking a link or signing up on a mailing list.   The author goes into some of the nuances of this method, such as how to interpret logistic regression coefficients, the use of the prediction vs. outcome matrix and lift charts (used to determine which deciles of the population in question are the best prospects).  He also includes a useful sidebar on multicollinearity, when two independent variables are highly correlated with each other.

·         A common use of logistic regression is related to attrition or retention.  Below is an exploratory example, using data that I have collected on the Top 100 Amazon ebooks for 2013 and 2014.  I took the book to be “retained” if it was still in the top 3200 rank by mid-2015.  The logistic regression then tests for variables that are significant in terms of predicting which books are in the retained group.  In this case, I just tested the effect of genre.  Note that this is a rather artificial example - normally a binary variable like this would indicate something like “did or didn’t drop out of school”, “did or didn’t die”, or some similarly stark yes/no result.  But we can imagine a case where a cut-off point in ranking could have that effect - for example a writer who fell below a given ranking might not have their next book accepted for publication.

The output below is from PSPP, an open source knock-off of SPSS.  The output shows that the only “statistically significant” genre effect is for Romance books, which are significantly less likely to be retained in the higher rankings than the reference genre, which was Thriller.  So, this result would imply that Romance writers have a shorter lifespan as an author, if rankings cutoffs are used to determine whether a writer continues to be published.
Through some mathematical calculations (involving exponentials, which “undo” the logistic regression, which is a form of regression analysis based on a logarithmic transformation of the basic regression equation) we can get a more understandable version of the result, namely that the probability of a book being retained in the higher ranking category, by genre is:

Lit Fic

·  ══════╦══════════╤═════╤════╤═════╤══╤════╤══════╗
·                  B  │S.E.│ Wald│df│Sig.│Exp(B)║
·║Step 1║DG_LITFIC │  .04│ .49│  .01│ 1│ .94│  1.04║
·      ║DG_OTHER  │ -.44│ .62│  .52│ 1│ .47│   .64║
·      ║DG_ROMANCE│-1.31│ .35│13.84│ 1│ .00│   .27║
·      ║DG_SFF      .70│ .70│  .99│ 1│ .32│  2.02║
·      ║Constant    .60│ .27│ 5.07│ 1│ .02│  1.82║

Chapter 5 – When are Customers Most Likely to Buy?

·         This continues the focus on methods that predict a dependent variable from one or more independent variables. These moves on to a fairly new method, known as survival analysis. Survival analysis, frequently encountered in medical research, can be used in marketing to estimate when customers are most likely to buy, rather than just the yes/no question answered with logistic regression.  Ultimately, one derives survival curves, similar to the life tables of demography.  

One other advantage/complication of survival analysis over logistic regression is its ability to handle “censored data”.  That has nothing to do with risqué pictures, but rather refers to data about respondents who have gone missing e.g. have dropped out of a study, moved to a new unknown address, etc.).  It can also refer to information on subjects who have not yet “converted” at the time that the analysis was done, or the study was cut off.
Note that besides tracking “time to event”, survival analysis also enables the researcher to examine covariates that could impact this measure.  This, of course, is key information.  If you discover that females purchase quicker than males, for example, that would be very useful in how one might market by gender.  So, survival analysis has both descriptive and predictive aspects.

·         I haven’t actually had an opportunity to use this method, so I can’t add anything specific to survival analysis.  One might note another technique that is used to determine why some records fall on one side of a binary divide or the other, which is called decision trees.  It’s more of a “data science” method than a statistical modelling method, though.  The author prefers the latter methods, though many people are now using the decision trees method.  It has the virtue of being quite easy to understand by most people.  However, it can result in too much emphasis being placed on relationships that are actually the result of random chance, given enough variables (though the analyst always has to be mindful of this possibility, regardless of the chosen method of analysis).

Chapter 6 – Modeling Dependent Variable Techniques (With More than One Equation)

·         This chapter goes into econometric modeling, using systems of simultaneous equations, basically supply and demand equations.  It talks about endogenous versus exogenous variables, in other words variables that are within the system (such as the price of the product) versus those outside the system (such as consumer incomes).  Even that is conceptually tricky. The seller doesn't have total control over pricing since incomes still have a huge influence over pricing pricing.  It is a complicated subject, and the author doesn't go into great detail – just enough to drive home the point that an enterprise has to look at the interaction and substitution effects that decisions on one product will have on other products, especially those that consumers consider to be close substitutes.
·         I haven’t had much to do with these methods, not being an economist.

In a later blog, I will go through some of the other statistical techniques that he explains, adding some of my own analytic experience, especially as it pertains to the book publishing domain.  Those methods are mostly of the classification and dimension reduction type (e.g. for market segmentation).


And, since this is a book themed blog, here is your chance to buy a book.  This is a travelogue, featuring a statistician and a truck driver, on a long haul trip, taking lumber to Texas and oilfield equipment to Alberta.  So, you get content that alludes to the theme of the blog – statisticians and markets. :).
On the Road with Bronco Billy - A Trucking Journal
Kindle Edition
What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.

Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.

No comments:

Post a Comment