Part 2 of a Review of “Marketing Analytics – A Practical Guide to Real Marketing Science” (by Mike Grigsby Kogan)
http://www.amazon.com/Marketing-Analytics-Practical-Guide-Science/dp/074947417A while back, I got a book from my Skillsoft learning library, with the above title. As a statistician/analyst at a university, I was curious about how the statistical techniques that I use on a routine basis are applied in the marketing world. And as someone who is involved in a small publishing venture, I was also curious about the theory and practice of marketing in general, and how it might be used to sell more novels :). So, I thought I would read the book and do a write-up for the blog, to help fix ideas in my own mind and inform blog readers as well.
Naturally, if the book interests you, you should go to the source. The Amazon link is given above. The book sells for about 20 bucks, in both e-book and paperback form. Though the content gets somewhat technical, given the subject matter, the writer maintains a very readable style in my opinion.
Since the book is fairly long, a proper look at it will take at least two blogs, maybe three. I previously did a blog on Part One of the book, which was concerned with some elementary statistical ideas, as well as some fundamental concepts and strategies within the marketing world. What follows is my synopsis of Part Two of the book, in point form. This section deals with some fairly advanced statistical techniques, in the “predictive analytics” realm, namely:
·
Multiple regression,
·
Logistic regression,
·
Survival analysis,
·
And econometric modelling.
In some cases, I have inserted an example of a given technique, from my own book related research, in italics.
Part Two – Dependent Variable Techniques
Chapter 3 – Modeling Dependent Variable Techniques - the things that drive demand
·
This chapter focuses on techniques that use equations
that predict a dependent variable, based on the values of one or more
independent variables. These are
generally known as regression techniques or general linear models. The author gives a decent explanation of
this, even mentioning a few subtle nuances, such as the use of dummy variables
and price elasticity (there was a substantial technical section on this
subject). But this is a very brief
review of a very large subject, so it is difficult to say whether a newcomer to
the concepts would be able to really understand it well.
·
Here’s a
quick example of a simple binary (two variable) regression relationship, from
some book publishing related data that I
dug out of the Amazon website. For a
selection of 18 books on the “Alsobot” of one our books (Kati of Terra Book 1),
I analyzed the book summaries with some text analysis software that counts the
number of “hard words” in a sample of text.
Then, I did the same for a sample of reviews of each of the books (about
20 reviews for each book) and averaged them.
The graph below shows how the complexity of the book summary is related
to the complexity of the reviews for
that book.
A simple binary (dependent
variable and one independent variable) regression (using Excel) of the complexity of
the book summary (called a blurb on the graph axis) and the complexity of the
reviews for that book show that they are related, in an approximately linear
fashion (the straight line). The graph shows the regression relation in
equation form and the strength of the association in “R-square” form. An R-square close to 1 indicates a very good
positive linear relationship between the variables, and R-square close to 0
indicates no linear relationship, and an R-square close to -1 indicates a very
good negative linear relationship. In
this case, the R-square is nearly 0.5, which indicates a reasonable strong
relationship.
What does it tell you? Briefly,
I interpret it to mean that people read books that are written at the level
with which they are comfortable, and that also corresponds to the level at
which they generally write (reviews in this case). More precisely, they read books that have
summaries written at that level, but we can reasonably assume that the summary is probably written at much the
same level of complexity as the book.
The book’s summary signals to potential readers how difficult the book
is likely to be, in a vocabulary sense.
Then, people respond (perhaps unconsciously) to that cue, and pick a
book that corresponds to their vocabulary, and eventually review it with a
similarly complex vocabulary.
This is a very simple model, since it only includes two variables and
assumes a linear relationship. Much more
complicated models are possible, which could include dozens of variables
(assuming a sufficiently rich dataset) as well as non-linear terms, interaction
effects and other complexities. But this
simple case gets the idea across.
Chapter 4 – Who is Most Likely to Buy and How do I Target?
·
This continues the focus on methods that predict
a dependent variable from one or more independent variables. In this case, the independent variable is of
the binary or yes/no variety, so the method under review is known as logistic
regression. In this case, the equation
predicts the odds, or probability that a particular outcome will occur, namely
that a person will buy the product or service in question. Note that other outcome could be of interest,
such as clicking a link or signing up on a mailing list. The author goes into some of the nuances of
this method, such as how to interpret logistic regression coefficients, the use
of the prediction vs. outcome matrix and lift charts (used to determine which
deciles of the population in question are the best prospects). He also includes a useful sidebar on
multicollinearity, when two independent variables are highly correlated with
each other.
·
A common
use of logistic regression is related to attrition or retention. Below is an exploratory example, using data
that I have collected on the Top 100 Amazon ebooks for 2013 and 2014. I took the book to be “retained” if it was
still in the top 3200 rank by mid-2015.
The logistic regression then tests for variables that are significant in
terms of predicting which books are in the retained group. In this case, I just tested the effect of
genre. Note that this is a rather
artificial example - normally a binary variable like this would indicate
something like “did or didn’t drop out of school”, “did or didn’t die”, or some
similarly stark yes/no result. But we
can imagine a case where a cut-off point in ranking could have that effect -
for example a writer who fell below a given ranking might not have their next
book accepted for publication.
The output below is from PSPP, an open source knock-off of SPSS. The output shows that the only “statistically
significant” genre effect is for Romance books, which are significantly less
likely to be retained in the higher rankings than the reference genre, which
was Thriller. So, this result would
imply that Romance writers have a shorter lifespan as an author, if rankings
cutoffs are used to determine whether a writer continues to be published.
Through some mathematical calculations (involving exponentials, which
“undo” the logistic regression, which is a form of regression analysis based on
a logarithmic transformation of the basic regression equation) we can get a
more understandable version of the result, namely that the probability of a
book being retained in the higher ranking category, by genre is:
Thriller
|
65%
|
Lit
Fic
|
66%
|
Other
|
54%
|
Romance
|
33%
|
SFF
|
79%
|
· ══════╦══════════╤═════╤════╤═════╤══╤════╤══════╗
·║ ║
│ B │S.E.│ Wald│df│Sig.│Exp(B)║
·╠══════╬══════════╪═════╪════╪═════╪══╪════╪══════╣
·║Step 1║DG_LITFIC │ .04│ .49│
.01│ 1│ .94│ 1.04║
·║ ║DG_OTHER
│ -.44│ .62│ .52│ 1│ .47│ .64║
·║ ║DG_ROMANCE│-1.31│ .35│13.84│ 1│
.00│ .27║
·║ ║DG_SFF
│ .70│ .70│ .99│ 1│ .32│
2.02║
·║ ║Constant
│ .60│ .27│ 5.07│ 1│ .02│ 1.82║
·╚══════╩══════════╧═════╧════╧═════╧══╧════╧══════╝
·
Chapter 5 – When are Customers Most Likely to Buy?
·
This continues the focus on methods that predict
a dependent variable from one or more independent variables. These moves on to
a fairly new method, known as survival analysis. Survival analysis, frequently
encountered in medical research, can be used in marketing to estimate when
customers are most likely to buy, rather than just the yes/no question answered
with logistic regression. Ultimately,
one derives survival curves, similar to the life tables of demography.
One other advantage/complication
of survival analysis over logistic regression is its ability to handle
“censored data”. That has nothing to do
with risqué pictures, but rather refers to data about respondents who have gone
missing e.g. have dropped out of a study, moved to a new unknown address,
etc.). It can also refer to information
on subjects who have not yet “converted” at the time that the analysis was
done, or the study was cut off.
Note that besides tracking “time
to event”, survival analysis also enables the researcher to examine covariates
that could impact this measure. This, of
course, is key information. If you
discover that females purchase quicker than males, for example, that would be
very useful in how one might market by gender.
So, survival analysis has both descriptive and predictive aspects.
·
I haven’t
actually had an opportunity to use this method, so I can’t add anything
specific to survival analysis. One might
note another technique that is used to determine why some records fall on one
side of a binary divide or the other, which is called decision trees. It’s more of a “data science” method than a
statistical modelling method, though. The
author prefers the latter methods, though many people are now using the
decision trees method. It has the virtue
of being quite easy to understand by most people. However, it can result in too much emphasis
being placed on relationships that are actually the result of random chance,
given enough variables (though the analyst always has to be mindful of this
possibility, regardless of the chosen method of analysis).
Chapter 6 – Modeling Dependent Variable Techniques (With More than One Equation)
·
This chapter goes into econometric modeling,
using systems of simultaneous equations, basically supply and demand
equations. It talks about endogenous
versus exogenous variables, in other words variables that are within the system
(such as the price of the product) versus those outside the system (such as
consumer incomes). Even that is
conceptually tricky. The seller doesn't have total control over pricing since
incomes still have a huge influence over pricing pricing. It is a complicated subject, and the author
doesn't go into great detail – just enough to drive home the point that an
enterprise has to look at the interaction and substitution effects that
decisions on one product will have on other products, especially those that
consumers consider to be close substitutes.
·
I haven’t
had much to do with these methods, not being an economist.
In a later blog, I will go through some of the other
statistical techniques that he explains, adding some of my own analytic
experience, especially as it pertains to the book publishing domain. Those methods are mostly of the
classification and dimension reduction type (e.g. for market segmentation).
–------------------------------------------------------
And, since this is a book themed blog, here is your chance to buy a book. This is a travelogue, featuring a statistician and a truck driver, on a long haul trip, taking lumber to Texas and oilfield equipment to Alberta. So, you get content that alludes to the theme of the blog – statisticians and markets. :).
Kindle Edition
Amazon U.S. http://www.amazon.com/gp/product/B00X2IRHSK
Amazon U.K. http://www.amazon.co.uk/gp/product/B00X2IRHSK
What follows is an account of a ten day journey through
western North America during a working trip, delivering lumber from Edmonton
Alberta to Dallas Texas, and returning with oilfield equipment. The writer had
the opportunity to accompany a friend who is a professional truck driver, which
he eagerly accepted. He works as a statistician for the University of Alberta,
and is therefore is generally confined to desk, chair, and computer. The chance
to see the world from the cab of a truck, and be immersed in the truck driving
culture was intriguing. In early May 1997 they hit the road.
Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.
We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.
The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.
Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.
We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.
The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.
No comments:
Post a Comment