Dodecahedron Books: March 2014

Saturday, 29 March 2014

The Boathouse Christ, a short story by Helena Puumala, now on Amazon

"The Boathouse Christ", a new short story by Helena Puumala is now available on Amazon, for only 99 cents. It is a fascinating fusion of the sacred and the paranormal.

Also, for this weekend only, "Love at the Lake", a beautiful romantic short story, is free to download from Amazon.

Both stories are part of Helena's ongoing Lake Stories, a series of short stories connected by their setting and recurring characters, set at a Northern Ontario lake where strange and wonderful things happen.

She has written another Lake Story, which will be published on the Easter weekend, and another one which will be published in the summer. After that, well, no doubt inspiration will continue to strike, but nobody can say when. But you can count on her.

http://www.amazon.com/Boathouse-Christ-Helena-Puumala-ebook/dp/B00JBRD90Q/ref=sr_1_1?s=books&ie=UTF8&qid=1396112880&sr=1-1&keywords=The+boathouse+Christ

Friday, 28 March 2014

Measuring the Top 100 Selling Kindle Books - Annual Sales vs Point-in-Time Snapshots

Amazon Top 100 Kindle Books, Indies versus Trads Sales Revisited Part 2 - Explaining how Daily Snapshots can Differ from Annual Rankings

In a recent blog, I did some comparisons of my analysis of Amazon’s Top 100 Kindle eBooks of 2013, versus the data recently released by noted SF writer Hugh Howie and his (currently unknown) data guru. They analysed a number of snapshot datasets, collected from Amazon’s website via a web “spider”, which can data mine publicly available internet data extremely quickly and efficiently. They have now released datasets of increasing size (the latest included 50,000 books) and have delved into books outside the genre categories. Those blogs of mine can be found under the general titles “Amazon Top 100 Kindle Books” in the Dodecahedron Books blog site. Hugh Howie’s can be found in the website “Author Earnings”.

One key difference between my analysis of the Amazon Top 100 eBooks of 2013 and the Howie/DataGuru analysis concerned the proportions of traditionally published books versus Indie books that were in the top 100. Though my original analysis was surprising enough in its estimate of the penetration of Indies in the Amazon best-sellers, the Howie/DataGuru data was even more favourable to Indies. The tables below recap those results, updating them with Howie/DataGuru’s most recent findings.

Here’s my result for percentage of Indie vs Trad books in the Top 100, along with the new results reported by Hugh Howie (next table).

Amazon Top 100, 2013	Total
Traditional	76%
Indie	24%
Grand Total	100%

These are Hugh/DataGuru’s numbers from the 50,000 book sample. I have added his “From Small or Medium Publisher”, “Big Five Published” and “Amazon Published” together, to be equivalent to my “Traditional” category. Similarly, I have added his “Indie Published” with “From Uncategorized Single-Author Publisher” together to be equivalent to my “Indie” category.

Hugh Howie’s Amazon snapshot, early 2014	Total
Traditional	64%
Indie	36%
Grand Total	100%

Why are the results different? Why do Indies account for 36% of Hugh Howie’s Feb 7, 2014 snapshot, but only 24% of the 2013 Amazon Top 100, by my count?

As I mentioned in an earlier blog, one possibility is simply that a lot changed between the times that the two samples represent. To recap that blog:

“ My Amazon Top 100 analysis was based on Amazon’s list of their top 100 books of 2013. In a sense then, it could be thought of as representing the mid-point of the 2013 data, since it is an accumulation of data collected throughout the year. Hugh’s analysis was from a snapshot in February 2014…about 8 months passed between the mid-point of one sample and the time of the second. In the current publishing world, a lot can change in 8 months, as we know.”

I also noted a second possibility, which I will explore below. To recap that blog:

“The second possibility is that the traditionally published books in the top 100 were more consistently present in that list over a longer time period, whereas any particular Indie book spends less time in the top 100, to be replaced by a new Indie book… there is more “churn” in the Indie books than the Trads….because the Trad authors have had longer careers and therefore have a ready-made fan base that allows [any particular trad title] to stick on the top of the list for a longer time. Indies have a more experimental audience, so any particular book doesn’t stay at the top as long, though as a group they are very successful .”

To explore this possibility, I constructed a model set of 200 books in Excel, which could be split into two groups:

· “Non-Stickers”, who sold between a lower and upper limit of copies of books each time period (a randomly generated number, between 10 and 1000 per month).

· “Stickers”, who sold between a lower and upper limit of copies of book each time period, but had a slightly higher number for the lower limit, which could be varied (a randomly generated numbers between a variable lower limit and 1000 copies per month).

I then generated twelve months of artificial data, showing the percentage of books that were “Non-Stickers” each month versus the percentage that were “Stickers”. Note that the “Stickers” have a slight edge in book sales in the non-control scenarios, but only a slight edge. There were ten trials performed under each set of assumptions, to ensure that the random number generator resulted in a good representation of the underlying statistical assumptions (i.e. utilizing the Central Limit Theorem aka the Law of Large Numbers, which simply means that as you do more trials your results will become closer and closer to the theoretical assumptions in your model).

The first two graphs show the results of having a dataset of 64% “Stickers”/36% “Non-Stickers”, with each group randomly selling somewhere between 10 and 1000 books per month. I chose the 64/36 ratio, because that is the proportions of Trads to Indies in Hugh Howie’s dataset of 50,000 Amazon books. This is the control scenario, where Stickers and non-Stickers sell the same number of books per month, on average. That would be 505 books each, the result of a uniform random number generator, that picked a number between 10 and 1000 each time, with each number having the same probability of being chosen.

As you can see, in this scenario, the average of the twelve monthly snapshots is almost exactly the same as the cumulative annual measure. That is, each month about 64% of books in the top decile of the sales rankings were from the stickers, which is also their percentage of the overall population of books. Their percentage of books in the top decile in the annual rankings is also 64%.

I then varied the lower limit of books sold for the “Stickers”, raising it slightly with each model run, while keeping it the same for the “Non-Stickers”. The results of half a dozen model runs are shown below, varying the lower limit each time. As you can see, the Howie/DataGuru results are reproduced when the “Stickers” have a lower bound of about 60 sales per month. That would imply an average of about 530 books per month, to the Indies average of 505 books per month. It corresponds to a difference that hardly shows up in the monthly data, but is very noticeable in the annual data.

The exact numbers for the six model runs are shown below, along with a graph of the results.

Lower Bound	Upper Bound	Annual, Top Percentile	Monthly, Top Percentile
10	1000	64%	64%
50	1000	68%	65%
62	1000	76%	67%
75	1000	79%	67%
100	1000	81%	67%
125	1000	90%	67%

So, projecting these results into the Trad/Indie results, it is clear that if the Trad published books tended to be only a little more consistent in their monthly sales results, they could quite easily have about 76% of the books in the Amazon Top 100 for the Year 2013, but only about 64% in a daily snapshot early in February 2014.

Obviously, this exercise doesn’t prove that this is what happened, but it does show that it is quite plausible. Furthermore, if the “stickiness factor” isn’t related to publisher category, but rather to length of time that a writer has been in the public eye, then this Trad/Indie difference will wither away, as Indies have more time to establish themselves in the marketplace.

Thursday, 20 March 2014

Measuring Luck

Measuring Luck

In the story “A Dark Horse”, in the collection Northern Gothic Stories, Daniel Foster, a gambler, has a difficult time with lucky streaks, bad and good. He worries that perhaps something more than the vagaries of random chance are at work, perhaps even something diabolical.

Anyway, a while back, I was locked into the deepest losing streak I'd ever known, maybe the deepest losing streak anyone has ever known. At least that's how it seemed to me.

I'd been six straight weeks without a winning day. Hell, at one point I'd gone fifty-seven straight races without seeing the cashier's window. The odds against that must be astronomical. A dead man could do better. I mean, pure dumb luck ought to count for something. At any rate, I was feeling pretty desperate.

…

My luck changed dramatically and all for the better. It's been the hottest winning streak I've ever had, for all I know it's been the hottest winning streak of all time.

I don't know and at this point I really don't care. You see, this streak has been too much, too unreal for me to feel comfortable with. I like to win. Every gambler likes to win. Hell, everyone likes to win, gambler or not. But this thing - I don't know. I think I preferred the losing streak.

The thing that's really got me are those damned dreams. Every night it's the same thing. They follow the script, the one I described before. I get up and go to the desk. The dark man refuses to tell me his name. We make a bargain, and we seal it with a drink. As I turn to leave, I hear the name of a horse. The next day that horse is on the card. If I bet the horse, he wins. If I don't bet he loses. I've made over $200,000 in the last month alone, and I'm not even trying. A horse player's dream, right? I'd give it all up for a good night's sleep.

http://www.amazon.com/Northern-Gothic-Stories-Dale-Olausen-ebook/dp/B00AQT8IJ0

I think we have all been through something like this in our lives, whether or not it involved gambling. It can be exhilarating to be on a hot streak, and devastating to be on a cold streak. It feels like the universe has singled you out, for better or worse. So, just how do you measure how likely or unlikely a hot streak or cold streak is?

First off, it helps to have a well defined measure of success or failure. In Daniel Foster’s case, it was success at the track, which can be measured in percentage of wins and money won or lost. In other cases, it can be more nebulous - how do you measure “lucky in love”, for example?

So let’s stick with something easy. In this case, we will look at the 20 year streak of ineptitude for Canadian NHL teams, in which they have not won a single Stanley Cup (to be fair, it is really 19, after excluding the lockout year). Is this just a streak of bad luck, or is something else at work?

There are a number of ways to tackle this problem, so we will go in order, from “common sense” methods, to physical modelling (via playing cards), to computer modelling (via excel), to theoretical mathematical methods (via the binomial theorem). Pick the one you understand the best and like the best.

We begin by looking at what percentage of NHL teams were Canadian during this time span, which turns out to be 21.2% overall, varying from a high of 23.3% to a low of 20%. So, naively, we would expect a Canadian team to win the Stanley Cup every 4 or 5 years, which corresponds to the proportion of teams in the league. So, 19 years does seem like a pretty long time to go without a Stanley Cup. Our naïve statistical sense tells us this is about 4 or 5 times longer than we would expect. Carrying on further with our naïve statistical instincts, we might say:

· There is roughly a 50% chance of a run of 5.

· So, there is roughly a 25% chance of a run of 10 (half of 50%).

· Then, there is roughly a 12.5% chance of a run of 15 (half of 25%).

· Giving a 6.25% chance of a run of 20 (half of 12.5%), more or less.

So, using our naïve probabilistic reasoning, we think a run of 19 or 20 years without a Canadian team winning the Stanley Cup is a pretty low likelihood event (at about 6.25%), but not alarmingly so.
This reasoning isn’t actually valid, but I think it gives a feel for how people reason about things like this. Plus, it does give an answer that is accurate, to a first approximation. It tells us that we wouldn’t expect a run like this very often.

What’s our next effort to figure this out?

This time, let’s try an experiment, using a real-world situation to model our problem. To do so, I took a deck of playing cards, and let the suit Clubs represent Canadian hockey teams (Montreal’s team is called the Canadiens’ Hockey Club, so I thought it appropriate to let Clubs represent the Canadian hockey clubs). There are 13 Clubs in a deck of 52 cards, so they represent 25% of the deck. If we remove the King and Queen of Clubs, then that suit has 11 cards out of the 50 remaining, representing 22% of the cards in the reduced deck. That’s as close as we can get to the 21.2% of Canadian teams in the NHL during the period in question, so we will go with that.

So, now we take our modified deck of playing cards, and simulate the hockey problem by:

· Shuffling the deck thoroughly.

· Dealing cards out until we come to a Club, counting the number of cards as we do so.

· Recording the length of the run of non-Clubs.

· Repeat this as many times as you like (I did 100 trials).

My results are given below:

Run Length	Frequency	Percent
1	20	20%
2	18	18%
3	6	6%
4	13	13%
5	10	10%
6	5	5%
7	8	8%
8	3	3%
9	4	4%
10	3	3%
11	3	3%
12	0	0%
13	3	3%
14	1	1%
15	1	1%
16	1	1%
17	0	0%
18	1	1%
19	0	0%
20	0	0%
20+	0	0%
Total	100

As you can see, the longest run was 18, not quite as long as the number of years that Canadian teams have gone without winning a Stanley Cup. So, it would appear that a run this long is less likely than our naïve statistical sense told us. In fact, it appears that a run of 19 or 20 has a likelihood of coming up about once every 100 trials, at best.

For the heck of it, I tried this again, though this time I didn’t re-shuffle after hitting a Club, but dealt the deck on to exhaustion. Basically, this was faster, though it wasn’t as statistically rigorous, since in any one shuffled deck, the short runs and long runs would be anti-correlated (sort of like the idea behind card counting in blackjack). Anyway, here are those results, this time using 250 trials:

Run Length Frequency Percent


1	47	19%
2	56	22%
3	27	11%
4	28	11%
5	25	10%
6	14	6%
7	13	5%
8	7	3%
9	6	2%
10	6	2%
11	3	1%
12	3	1%
13	5	2%
14	2	1%
15	1	0%
16	1	0%
17	2	1%
18	2	1%
19	1	0%
20	0	0%
20+	1	0%
Total	250

This time we hit one run of 19 and one run of 20+, so that’s 2 out of 250, or a little under 1%. Surprisingly enough, the longest run was a run of 29. It’s always weird to witness a such a low probability event happen before your eyes, even if it means very little in the real world.

So, I think we can safely say from the experimental evidence, that a run of 19 years without a Canadian team winning the Stanley Cup is a pretty unusual event (odds are on the order of 1%), if it is just a matter of random chance.

I also set up a Monte Carlo simulation in Excel. In that one, I simulated a 20,000 year NHL history (yes, it is a bit excessive), and counted how many times a run of 19 or more came, from random chance, given a Canadian team representation of 21.2% of all teams. In 100 trials of this simulation, the median percentage of runs of that length was a bit under 1.0%. That conforms nicely to our playing card experiment.

Finally, we can look at this as a simple binomial distribution probability problem (you may remember this from high school or university math courses), with p=.21 and n=19, and consult a table like this one:

http://www.utstat.toronto.edu/~olgac/stab22_Winter_2014/datasets/Binomialtable.pdf

Doing that, we also find the probability of a 19 year run of no Canadian Stanley Cups to be about 1%.

The nice thing about the playing card simulation, is that it is easy to understand and anyone can do it if they choose to spend half an hour or so shuffling and dealing cards (perhaps while watching their favorite Canadian hockey team scrub out of the playoffs). You don’t need a strong math background, just common sense.

So, have we discovered whether the lack of success of Canadian hockey teams is just one of those things, or is there something deeper at work? I suppose it is still a judgement call, but it does make you wonder. Every year this goes on, makes you wonder even more. We’ll leave it at that for now.