· Basically, we have used multiple lines of evidence to show that there is a fairly reliable relationship between the number of reviews a book received and the number of sales of that book.
· We know that if we do a scatter plot of the number of reviews a book receives versus the rank of that book, the data follows a power law, with some considerable noise. That would indicate that the number of reviews a book receives is related to its sales, since we know that sales also follow a power law. The power law that we see in the number of reviews graph is a sort of visible manifestation of the underlying sales power law.
· The fact that the sales ranks and the number of reviews ranks are also correlated confirms that there is a relationship between number of sales and number of reviews.
· So, we can be reasonable confident that reviews are a decent predictor of sales. If we know the number of reviews a book received, we can estimate its sales. That won’t necessarily be accurate for any particular book, but it will be pretty good on average.
· As a rough guide, we can multiply the number of reviews that a book received by 100, to estimate that book’s sales. If we have reason to think that the book had a lot of free downloads, we will have to adjust our estimate accordingly.
Next time we will look a little more deeply into the patterns of reader engagement that this data may reveal.
Now, for the details:
Recently, Amazon released lists of its 100 top selling
titles for Kindle ebooks. In some previous
blogs, we looked at how Indies (self or independently published ebooks) compared
to Trads (traditionally published or trade published ebooks) in terms of books
sold (as measured by ranking within the top 100), reader satisfaction (as
measured by Amazon reviews) and imputed revenues/earnings. In this fourth blog we will see what we can
learn about the relationship between sales rank a book gets and the number of
reviews it gets, at least in the top 100 Amazon ebooks.
( For more information regarding the findings from those
earlier analyses, see the previous blogs
Amazon Top 100 Kindle Books - Indies versus Trads Part 1, Amazon
Top 100 Kindle Books - Indies versus Trads Part 2, and Amazon Top 100 Kindle
Books - Indies versus Trads Part 3.)
As we noted in earlier blogs, we have a gap in our knowledge
of the Amazon Top 100 list - we know the sales rank of the books, but we didn’t
actually know the absolute level of sales - i.e. the actual number of books
sold. We imputed this number from some
other (limited) evidence, assuming that about 1% of all books sold ended up
being reviewed. This was based on some
informal surveys on Kindleboards, and some analysis of Joe Konrath’s sales
data, which he provided on his blog. A
graph of the Kindleboards paid data is shown below, along with a graph that
includes both paid and free downloads.
For books that we have good reason to believe had a lot of free
downloads, perhaps 0.5% might be a better estimate.
KindleBoards -
2013 Informal Survey of Reviews vs Sales/Downloads
Here’s a scatterplot of Joe Konrath’s sales and revenue data
from his blog, with reviews data mined from Amazon. We see that the percentage of review is
broadly similar to the Kindleboards data, with around 1% of books that didn’t
have a lot of free downloads reviewed, and about .30% of books that did have
substantial free downloads reviewed. We
also see that review percentages tend to go up with average revenues per book,
indicating that free downloads were not as likely to be reviewed as purchased
books.
Assuming that this relationship holds, we should see that
the top ranked books generally had more Amazon reviews than books that ranked
lower down the list. As the graph below
shows, this is true. The higher ranked
books did get more reviews, though the relationship shows a fair bit of
scatter.
The solid line shows our old friend, a power law. That was Excel’s best fit power function to
the dataset. The equation is also shown
- we will use that later to estimate expected review counts for books much
farther down the list than these top 100.
Note that we could derive this ourselves by rescaling the data
logarithmically, estimating a best straight line, then getting the value of the
exponent from the slope of that line.
But, we’re lazy, so we will go with Excel’s result.
There are a number of outliers - for example, one book
ranked at about position 70 had nearly 4000 reviews, much higher than the
general trend. Conversely, some books
had fewer reviews than might be expected, such as the some in the teens in
rank, that only had about 1000 reviews, when we would expect about twice that
much. But we expect a certain amount of
scatter in a real world dataset, where variables almost never co-vary
perfectly. It would be suspicious if
they did.
Here are some estimates of how many reviews we would expect
a book to receive, given its position in the sales rankings, based on the
functional form given by the solid line (and formula), then extrapolating that
relationship. Note that these results
are for an “all books” type list, since the top 100 were not segmented by
genre. If we were looking at results
within a specific genre, the results would be much different (that would be an
interesting analysis, though) :
Amazon Rank versus Number of Expected Reviews
Rank
|
Reviews
|
10
|
2350
|
20
|
1650
|
40
|
1150
|
80
|
810
|
160
|
570
|
320
|
400
|
640
|
280
|
1,280
|
190
|
2,560
|
140
|
5,120
|
100
|
10,240
|
70
|
20,480
|
50
|
40,960
|
30
|
81,920
|
20
|
163,840
|
20
|
327,680
|
10
|
We can look at the relationship between sales rank and
reviews another way, namely by transforming the reviews data from raw numbers into
ranks, then compare those ranks with sales rank. I have done that in the graph below. I am hoping that this graph comes out
reasonably well in a blog - it might be expecting a lot of blogger, though, as
it is pretty packed with detail.
The red dotted line shows what would happen if there was a
perfect correlation (that just means matching) between the Sales Rank and the
Number of Reviews Rank. The blue symbols
show the actual data points in the dataset.
Points that are below that red dotted line have more reviews than would
be expected from their sales rank, while points above the line have fewer
reviews than would be expected from their sales rank. For example, you can see a point that is
about 30th in sales, but about 90th in reviews.
It has far fewer reviews than we would expect. That book happens to be a Trad published
romance. Conversely, there is a point at
nearly 80th place in sales rank, but about 15th place in reviews rank. It has far more reviews than we would
expect. That book happens to be an Indie
romance.
We will try to delve a little deeper into whether there are
any patterns in this “too few/too many” dimension a little later, probably in
the next blog. We can use that to
uncover patterns of reader engagement, by price, genre, and so forth. Note that engagement isn’t necessarily the
same thing as satisfaction – someone can be passionate enough about a book to
engage (write a review), but not necessarily be happy about the book.
First, let’s look into whether there really is a relationship
here, or if the points are just random.
I calculated something called a rank correlation, then did something
called a simple hypothesis test. This is
standard statistical stuff, and it established to a very high certainty that
there really is a statistically strong relationship between sales rank and
review rank - higher ranked books are also higher ranked in reviews. By the way, the rank correlation was .615 and
the test statistic was 6.12, highly significant under the appropriate
statistical assumptions and test statistic distribution.
·
Basically, we have used multiple lines of
evidence to show that there is a fairly reliable relationship between the
number of reviews a book received and the number of sales of that book.
·
We know that if we do a scatter plot of the
number of reviews a book receives versus the rank of that book, the data
follows a power law, with some considerable noise. That would indicate that the number of
reviews a book receives is related to its sales, since we know that sales also
follow a power law. The power law that
we see in the number of reviews graph is a sort of visible manifestation of the
underlying sales power law.
·
The fact that the sales ranks and the number of
reviews ranks are also correlated confirms that there is a relationship between
number of sales and number of reviews.
·
So, we can be reasonable confident that reviews
are a decent predictor of sales. If we
know the number of reviews a book received, we can estimate its sales. That won’t necessarily be accurate for any
particular book, but it will be pretty good on average.
·
As a rough guide, we can multiply the number of
reviews that a book received by 100, to estimate that book’s sales.
Next time we will look a little
more deeply into the patterns of reader engagement that this data may reveal.
No comments:
Post a Comment