Friday, 17 January 2014

Amazon Top 100 Kindle Books - Relationship between Sales Rank and Number of Reviews

Here's a quick summary of the blog, since it is lengthy but fascinating :)

· Basically, we have used multiple lines of evidence to show that there is a fairly reliable relationship between the number of reviews a book received and the number of sales of that book.
· We know that if we do a scatter plot of the number of reviews a book receives versus the rank of that book, the data follows a power law, with some considerable noise. That would indicate that the number of reviews a book receives is related to its sales, since we know that sales also follow a power law. The power law that we see in the number of reviews graph is a sort of visible manifestation of the underlying sales power law.
· The fact that the sales ranks and the number of reviews ranks are also correlated confirms that there is a relationship between number of sales and number of reviews.
· So, we can be reasonable confident that reviews are a decent predictor of sales. If we know the number of reviews a book received, we can estimate its sales. That won’t necessarily be accurate for any particular book, but it will be pretty good on average.
· As a rough guide, we can multiply the number of reviews that a book received by 100, to estimate that book’s sales.  If we have reason to think that the book had a lot of free downloads, we will have to adjust our estimate accordingly. 
Next time we will look a little more deeply into the patterns of reader engagement that this data may reveal.
 
Now, for the details:
Recently, Amazon released lists of its 100 top selling titles for Kindle ebooks.  In some previous blogs, we looked at how Indies (self or independently published ebooks) compared to Trads (traditionally published or trade published ebooks) in terms of books sold (as measured by ranking within the top 100), reader satisfaction (as measured by Amazon reviews) and imputed revenues/earnings.  In this fourth blog we will see what we can learn about the relationship between sales rank a book gets and the number of reviews it gets, at least in the top 100 Amazon ebooks.

( For more information regarding the findings from those earlier analyses, see the previous blogs  Amazon Top 100 Kindle Books - Indies versus Trads Part 1, Amazon Top 100 Kindle Books - Indies versus Trads Part 2, and Amazon Top 100 Kindle Books - Indies versus Trads Part 3.)

As we noted in earlier blogs, we have a gap in our knowledge of the Amazon Top 100 list - we know the sales rank of the books, but we didn’t actually know the absolute level of sales - i.e. the actual number of books sold.  We imputed this number from some other (limited) evidence, assuming that about 1% of all books sold ended up being reviewed.  This was based on some informal surveys on Kindleboards, and some analysis of Joe Konrath’s sales data, which he provided on his blog.  A graph of the Kindleboards paid data is shown below, along with a graph that includes both paid and free downloads.  For books that we have good reason to believe had a lot of free downloads, perhaps 0.5% might be a better estimate.

KindleBoards - 2013 Informal Survey of Reviews vs Sales/Downloads

 
 
Here’s a scatterplot of Joe Konrath’s sales and revenue data from his blog, with reviews data mined from Amazon.  We see that the percentage of review is broadly similar to the Kindleboards data, with around 1% of books that didn’t have a lot of free downloads reviewed, and about .30% of books that did have substantial free downloads reviewed.  We also see that review percentages tend to go up with average revenues per book, indicating that free downloads were not as likely to be reviewed as purchased books.
 
 
 
 
Assuming that this relationship holds, we should see that the top ranked books generally had more Amazon reviews than books that ranked lower down the list.  As the graph below shows, this is true.  The higher ranked books did get more reviews, though the relationship shows a fair bit of scatter.

The solid line shows our old friend, a power law.  That was Excel’s best fit power function to the dataset.  The equation is also shown - we will use that later to estimate expected review counts for books much farther down the list than these top 100.  Note that we could derive this ourselves by rescaling the data logarithmically, estimating a best straight line, then getting the value of the exponent from the slope of that line.  But, we’re lazy, so we will go with Excel’s result.

 








 









There are a number of outliers - for example, one book ranked at about position 70 had nearly 4000 reviews, much higher than the general trend.  Conversely, some books had fewer reviews than might be expected, such as the some in the teens in rank, that only had about 1000 reviews, when we would expect about twice that much.  But we expect a certain amount of scatter in a real world dataset, where variables almost never co-vary perfectly.  It would be suspicious if they did.
Here are some estimates of how many reviews we would expect a book to receive, given its position in the sales rankings, based on the functional form given by the solid line (and formula), then extrapolating that relationship.  Note that these results are for an “all books” type list, since the top 100 were not segmented by genre.  If we were looking at results within a specific genre, the results would be much different (that would be an interesting analysis, though) :
Amazon Rank versus Number of Expected Reviews

Rank

Reviews

10

2350
 

20

1650

40

1150

80

810

160

570

320

400

640

280

1,280

190

2,560

140

5,120

100

10,240

70

20,480

50

40,960

30

81,920

20

163,840

20

327,680

10
A word of caution is in order.  It is always risky to extrapolate very far from the ends of your actual dataset, and power law relationships often don’t play nicely in the tails of a distribution.  Nonetheless, the result is pretty much in line with what we might expect.  Books way down the list are lucky to get a dozen reviews, while those far up the list can easily get thousands.  The result has face validity.
We can look at the relationship between sales rank and reviews another way, namely by transforming the reviews data from raw numbers into ranks, then compare those ranks with sales rank.  I have done that in the graph below.  I am hoping that this graph comes out reasonably well in a blog - it might be expecting a lot of blogger, though, as it is pretty packed with detail.
The red dotted line shows what would happen if there was a perfect correlation (that just means matching) between the Sales Rank and the Number of Reviews Rank.  The blue symbols show the actual data points in the dataset.  Points that are below that red dotted line have more reviews than would be expected from their sales rank, while points above the line have fewer reviews than would be expected from their sales rank.   For example, you can see a point that is about 30th in sales, but about 90th in reviews.  It has far fewer reviews than we would expect.  That book happens to be a Trad published romance.  Conversely, there is a point at nearly 80th place in sales rank, but about 15th place in reviews rank.  It has far more reviews than we would expect.  That book happens to be an Indie romance.
We will try to delve a little deeper into whether there are any patterns in this “too few/too many” dimension a little later, probably in the next blog.  We can use that to uncover patterns of reader engagement, by price, genre, and so forth.  Note that engagement isn’t necessarily the same thing as satisfaction – someone can be passionate enough about a book to engage (write a review), but not necessarily be happy about the book.
First, let’s look into whether there really is a relationship here, or if the points are just random.  I calculated something called a rank correlation, then did something called a simple hypothesis test.  This is standard statistical stuff, and it established to a very high certainty that there really is a statistically strong relationship between sales rank and review rank - higher ranked books are also higher ranked in reviews.  By the way, the rank correlation was .615 and the test statistic was 6.12, highly significant under the appropriate statistical assumptions and test statistic distribution.
So, what’s the story this time?
·         Basically, we have used multiple lines of evidence to show that there is a fairly reliable relationship between the number of reviews a book received and the number of sales of that book.
·         We know that if we do a scatter plot of the number of reviews a book receives versus the rank of that book, the data follows a power law, with some considerable noise.  That would indicate that the number of reviews a book receives is related to its sales, since we know that sales also follow a power law.  The power law that we see in the number of reviews graph is a sort of visible manifestation of the underlying sales power law.
·         The fact that the sales ranks and the number of reviews ranks are also correlated confirms that there is a relationship between number of sales and number of reviews.
·         So, we can be reasonable confident that reviews are a decent predictor of sales.  If we know the number of reviews a book received, we can estimate its sales.  That won’t necessarily be accurate for any particular book, but it will be pretty good on average.
·         As a rough guide, we can multiply the number of reviews that a book received by 100, to estimate that book’s sales.
Next time we will look a little more deeply into the patterns of reader engagement that this data may reveal.



No comments:

Post a Comment