Friday, 4 July 2014

Book Statistics Corner, Part 5 – Amazon Reviewer Trends of Ten of the Most Popular Book Series

In a couple of previous  blogs, we looked at some statistics on sales for popular book series of the recent past.   As a memory refresher, those book series are repeated below:

Author and Series
J.K. Rowling - Harry Potter
Dan Brown - Robert Langdon
Stephanie Myers – Twilight
Suzanne Collins - Hunger Games
Robert Jordan - Wheel of Time
Stephen King - The Dark Tower
G.R.R. Martin - Game of Thrones
Veronica Roth – Divergent
Douglas Adams - Hitchhikers Guide
Patrick O'Brian - Aubrey/Maturin

As noted previously, these 10 series represent nearly 1 billion copies sold.
In the last blog, we focused primarily on how the number of Goodreads raters varied by the position of book within the series.  In other words, we tracked the number of people who rated book 1, book 2, book 3, etc. (normalized to book 1 = 100) to see if there was a pattern to the data.  And in fact we did discover a very prominent trend, which was true for most of the series that we looked at; for the most part,  the number of books rated by the Goodreads community declined from book to book, with the drop-off best modelled as our old friend, the power law.

Naturally we didn’t do this merely to analyse the behaviour of Goodreads raters, as interesting and worthy as that exercise might be.  We were, in fact, assuming that the number of Goodreads raters were a reasonably constant fraction of the number of people who had actually purchased and read the books in question.  So, we assumed that the pattern of Goodreads raters was likely to be quite similar to the pattern of purchasers, a sort of attenuated mirror image.  We examined the Harry Potter series in detail (a series for which reasonably accurate book by book sales are known) to verify that the number of Goodreads raters does, in fact, reflect the sales of the individual books.
The people on Goodreads are an interesting sample of avid readers, but their willingness to rate a book is a reflection of the popularity of a book over a long period.  You don’t have to have bought a book very recently to be able to add it to your Goodreads “have read” list and to rate that book.  That’s interesting data, but what about the more recent developments in the book trade?  A lot has happened in the past 5 to 10 years, particularly the rise on-line book sales, both physical books and ebooks.  This includes the vast new supply of books that have been added to the world’s “book population” by small publishers and self-publishers (generally speaking we can use the term Indie for these)  as well as the increased production of the big publishers, front-list and back-list.
In this blog we will look at Amazon data, to try to get a handle on the newer publishing world.  For the most part, Amazon reviewers have purchased their books fairly recently, and usually from Amazon.  As time goes on, those books are being purchased more and more in the kindle/ebook format, which is a very different experience from buying physical books.  Ebooks are instantaneous, always available, and relatively cheap (see Dodecahedron blog “Imagine that you had a magic wineglass” for some further exploration of those ideas).  So, how have these new facts changed the pattern of book buying within these popular series listed earlier in the blog?
Let’s look at the book series in detail, focusing on the number of Amazon raters vs the position of the book within the series, and compare that with our previous results using Goodreads raters.  Again, we will go by series book sales, from largest to smallest.  In the graphs that follow, Amazon data will be in blue (lines and diamond markers), while Goodreads data will be in red (lines and square markers).  The best fit equations of these graphs are also shown, highlighted in the appropriate colors.  Also included is the R-square, which is a way of measuring how well the data actually fits the equation.  An R-square near 1.00 implies an excellent fit, while an R-square near 0 implies a very poor fit.  Scores in between those extremes are less clear-cut.
In the graphs, the Amazon data are modelled by quadratic functions.  Though they are very imperfect fits, the quadratic model seemed to capture one very important feature of the Amazon rater data; in many cases, the earlier and later books in the series got the most reviews, while the mid-point books were less likely to be reviewed.  A quadratic function incorporates that well, since the nature of a quadratic is to have one inflection point(a maximum or minimum).  The power-law and straight line function fit R-squares are also shown, to help indicate which functional form best fits the observed data.
As before, the Goodreads data are modelled by power law functions.  You can refer to the earlier blogs on power functions to refresh your memory on those.   The main feature of a power law that is important here, however, is that the series decays, with each book being a (more or less) constant fraction of the one before it.
Note that in both cases, these are standard Excel options for modelling data. 
1 – Harry Potter (J.K. Rowling)
The Amazon data wasn’t particularly well fit by a quadratic, though we do seem to see a general trend where the number of reviews sags in the middle of the series.  As noted previously, the Goodreads data  followed a power law quite closely.
Testing the Amazon data for three different functional forms (power law, quadratic and straight line), it turns out that the R-square is marginally better for the quadratic than the others.
 Power law R-square =
Quadratic R-square  =
Straight line R-square =
2 – Robert Langdon (Dan Brown)
In this case, we see that the quadratic function fit the Amazon data quite well, though that was probably mainly due to the influence of the last data point, which refers to the most recent book of the series.   Evidently that book was much more “popular” on Amazon than on Goodreads, at least in as much as people were inclined to do reviews.
Again, when testing the three functional forms for the Amazon data, we find the quadratic has the best fit, somewhat better than a straight-line fit (though that one wasn’t bad, either).
Power law R-square  =
Quadratic R-square  =
Straight line R-square =

3 – Twilight (Stephanie Myers)
In this case, the quadratic function was an excellent fit to the Amazon data, with the first and last books getting almost the same level of reviews, both far higher than the middle two books.  As noted previously, the Goodreads data followed a power law very closely.  So this seems to be a textbook case, contrasting the situations in the Amazon world versus the Goodreads world. 
For the Amazon data, the quadratic form is far superior to the others:
Power law R-square =

Quadratic R-square =

Straight line R-square =



4 –Hunger Games (Suzanne Collins)
As with the Twilight series, the Hunger Games series demonstrates the Amazon versus Goodreads responses very well.  However, with only three data points we have to be careful not to over-interpret our results.  It is trivially true that 3 points can be made to fit a quadratic perfectly (as long as they aren’t on a straight line), much as a 2 points can be made to fit a straight line perfectly.  Nonetheless, it is notable that the Amazon data fits the general picture that we have seen in the other cases, with the first and last books drawing more interest that the second.
For the Amazon data, the quadratic form is far superior to the others, though the “perfect fit” to three points is no surprise, as noted above:

Power law R-square =
Quadratic R-square =
Straight line R-square =

5 –Wheel of Time (Robert Jordan)
As with the Harry Potter series, the Amazon data for this long series was not particularly well modelled by the quadratic function.  The first and last books of the series were high points in terms of reviews, but some of the middle books also did very well in that regard.  Curiously, those were not books that were notable in the Goodreads data, which was modelled by a power series fairly well.
Nonetheless, for the Amazon data, the quadratic form is superior to the others.  Basically, though, this series was not well represented by any simple functional form.

Power law R-square =
Quadratic R-square =
Straight line R-square =

6 –The Dark Tower (Stephen King)
As with Harry Potter and Wheel of Time, the Amazon data for this series is not particularly well modelled by a quadratic, but it does follow the general trend of the first and last books being reviewed more often than the middle books.   Again, however, one of the middle books was an “outlier”. On the other hand, the Goodreads data was an excellent fit to a power law. 

Once more, though, for the Amazon data, the quadratic form is superior to the others.

Power law R-square =
Quadratic R-square =
Straight line R-square =

7 –Game of Thrones (G.R.R. Martin)
This series seems to follow the same general trend as the Twilight and Hunger Games series, which is to say that the first and last books had much higher numbers of reviews, relative to the middle books.  So, the fit to the quadratic form is very high (though there are only 5 points).  As for the Goodreads data, it is very well fit by the power law.
Once more, for the Amazon data, the quadratic form is far superior to the others.

Power law R-square =
Quadratic R-square =
Straight line R-square =

8 –Divergent (Veronica Roth)
This series follows a similar pattern to Twilight, Hunger Games and Game of Thrones.  In all of those cases, the first and last books of the series drew more Amazon interest than the middle book(s).  However, as with Hunger Games, we must note that there were only three books in the series, so a quadratic will naturally be a perfect fit.  As for the Goodreads data, as noted earlier, it had a very good fit to a power law.
As noted below, for the Amazon data, the  quadratic fit is superior (trivially so, with three data points).
Power law R-square  =
Quadratic R-square =
Straight line R-square =

9 –Hitchhikers’ Guide (Douglas Adams)
For the Amazon data, the Hitchhikers series is a good fit to a quadratic form.  However, that’s mainly due to the influence of the first point.  It actually appears to conform nearly as closely to a power law fit as the Goodreads data did.

The fits of the various functional forms to the Amazon data make that explicit, below.

Power law R-square =
Quadratic R-square =
Straight line R-square =

10 – Aubrey/Maturin (Patrick O’Brian)
The Aubrey/Maturin series conforms somewhat to the quadratic form in the Amazon data – the first and last books drew the greatest amount of interest.  But, as with Hitchhikers, the Amazon data actually conformed very well to a power law, as did the Goodreads data. 

Once more, comparing the R-squares of the various functional fits brings that out, as shown below.

Power law R-square =
Quadratic R-square =
Straight line R-square =


Some Conclusions
·         It appears that the pattern in the number of Amazon reviewers per book is quite different from the trend in the number of Goodreads raters per book.
·         Amazon reviewers seem to be inclined to review the first and last books of a series more than the middle books, resulting in a quadratic fit to the data.  As noted earlier, Goodreads raters tend to drop off continuously as the series proceeds, resulting in a power law.
·         The Amazon quadratic function phenomenon is much more evident in more recent book series, namely:
o   Robert Langdon (Dan Brown)
o   Twilight (Stephanie Myer)
o   Hunger Games (Suzanne Collins)
o   Game of Thrones (G.R.R. Martin)
o   Divergent (Veronica Roth)
·         In some of the older series, the Amazon and Goodreads trends in reviews/raters were quite similar (best modelled by a power law), namely:
o   Hitchhikers Guide (Douglas Adams)
o   Aubrey/Maturin (Patrick O’Brian)
·         The other three series were less clear-cut, but the Amazon data still tended to be modelled somewhat better by the quadratic:
o   Harry Potter (J.K. Rowling
o   Wheel of Time (Robert Jordan)
o   Dark Tower (Stephen King)
·         We can’t be sure whether the tendency for the Amazon reviewers to be more focussed on “first and last” is a reflection of underlying purchasing numbers or a reviewing preference, though it’s probably a bit of both.
·         Some people may be willing to skip some of the middle books in a series.  They may get hooked on the first book, not have time to read some of the middle books (and thus skip them), but want to find out how the story arc went by purchasing and reading the final book.
·         On the other hand, people are more likely to want to weigh in with their opinions at the outset of a series or at the conclusion of the series than they are in the middle of the series.  There is a common human reaction to want to jump on the bandwagon at the start and let the world know about.  People also want to make their “summing up” judgements known.  So this could account for the prominence of first and last book reviews predominating, even if the middle books were purchased and read.
The one thing that does seem pretty clear is that the Amazon ebook world has produced quite a different reviewing (and presumably purchasing) pattern than the old world of physical books and bookstores.   In the old world, scarcity was the rule - if you didn’t jump into a series at the start, you might never find the early books of the series (short of haunting used bookstores).  Now, if a series interests you, you can jump in at any time and read the whole series.
In our own small way, we have seen this at Dodecahedron Books in the buying patterns for Kati of Terra series.  When Kati 2 came out, it sparked as many sales of Kati 1 over the following year as it did of Kati 2.  Kati 3 seems to be having a somewhat similar effect.  In this case, at least, it seems that people were seeing Kati 2 and saying “that looks interesting, but I might as well start with the first book of the series”.  Since ebooks are always available (no windowing as with physical bookstores) this is a perfectly logical response.   It will be interesting to see how these patterns evolve over time.

No comments:

Post a Comment