"The Boathouse Christ", a new short story by Helena Puumala is now available on Amazon, for only 99 cents. It is a fascinating fusion of the sacred and the paranormal.
Also, for this weekend only, "Love at the Lake", a beautiful romantic short story, is free to download from Amazon.
Both stories are part of Helena's ongoing Lake Stories, a series of short stories connected by their setting and recurring characters, set at a Northern Ontario lake where strange and wonderful things happen.
She has written another Lake Story, which will be published on the Easter weekend, and another one which will be published in the summer. After that, well, no doubt inspiration will continue to strike, but nobody can say when. But you can count on her.
http://www.amazon.com/Boathouse-Christ-Helena-Puumala-ebook/dp/B00JBRD90Q/ref=sr_1_1?s=books&ie=UTF8&qid=1396112880&sr=1-1&keywords=The+boathouse+Christ
Saturday, 29 March 2014
Friday, 28 March 2014
Measuring the Top 100 Selling Kindle Books - Annual Sales vs Point-in-Time Snapshots
Amazon Top 100 Kindle Books, Indies
versus Trads Sales Revisited Part 2 - Explaining how Daily Snapshots can Differ
from Annual Rankings
Here’s my result for percentage of Indie vs Trad books in the Top 100, along with the new results reported by Hugh Howie (next table).
As you can see, in this scenario, the average of the twelve
monthly snapshots is almost exactly the same as the cumulative annual
measure. That is, each month about 64%
of books in the top decile of the sales rankings were from the stickers, which
is also their percentage of the overall population of books. Their percentage of books in the top decile
in the annual rankings is also 64%.
In a recent blog, I did some comparisons of my analysis of
Amazon’s Top 100 Kindle eBooks of 2013, versus the data recently released by noted
SF writer Hugh Howie and his (currently unknown) data guru. They analysed a number of snapshot datasets, collected from Amazon’s website via a web
“spider”, which can data mine publicly available internet data extremely
quickly and efficiently. They have now
released datasets of increasing size (the latest included 50,000 books) and
have delved into books outside the genre categories. Those blogs of mine can be found under the
general titles “Amazon Top 100 Kindle Books” in the Dodecahedron Books blog
site. Hugh Howie’s can be found in the
website “Author Earnings”.
One key difference between my analysis of the Amazon Top 100
eBooks of 2013 and the Howie/DataGuru analysis concerned the proportions of traditionally
published books versus Indie books that were in the top 100.
Though my original analysis was surprising enough in its estimate of the
penetration of Indies in the Amazon best-sellers, the Howie/DataGuru data was
even more favourable to Indies. The
tables below recap those results, updating them with Howie/DataGuru’s most
recent findings. Here’s my result for percentage of Indie vs Trad books in the Top 100, along with the new results reported by Hugh Howie (next table).
Amazon Top 100, 2013
|
Total
|
Traditional
|
76%
|
Indie
|
24%
|
Grand Total
|
100%
|
These are Hugh/DataGuru’s numbers from the 50,000 book
sample. I have added his “From Small or
Medium Publisher”, “Big Five Published” and “Amazon Published” together, to be
equivalent to my “Traditional” category.
Similarly, I have added his “Indie Published” with “From Uncategorized
Single-Author Publisher” together to be equivalent to my “Indie” category.
Hugh Howie’s Amazon snapshot, early
2014
|
Total
|
Traditional
|
64%
|
Indie
|
36%
|
Grand Total
|
100%
|
Why are the results different? Why do Indies account for 36% of Hugh Howie’s
Feb 7, 2014 snapshot, but only 24% of the 2013 Amazon Top 100, by my count?
As I mentioned in an earlier blog, one possibility is simply
that a lot changed between the times that the two samples represent. To recap that blog:
“ My Amazon Top 100 analysis was
based on Amazon’s list of their top 100 books of 2013. In a sense then, it could be thought of as
representing the mid-point of the 2013 data, since it is an accumulation of
data collected throughout the year.
Hugh’s analysis was from a snapshot in February 2014…about 8 months
passed between the mid-point of one sample and the time of the second. In the current publishing world, a lot can
change in 8 months, as we know.”
I also noted a second possibility, which I will explore
below. To recap that blog:
“The second possibility is that
the traditionally published books in the top 100 were more consistently present
in that list over a longer time period, whereas any particular Indie book spends
less time in the top 100, to be replaced by a new Indie book… there is more
“churn” in the Indie books than the Trads….because the Trad authors have had
longer careers and therefore have a ready-made fan base that allows [any
particular trad title] to stick on the top of the list for a longer time. Indies
have a more experimental audience, so any particular book doesn’t stay at the
top as long, though as a group they are very successful .”
To explore this possibility, I constructed a model set of
200 books in Excel, which could be split into two groups:
·
“Non-Stickers”, who sold between a lower and upper
limit of copies of books each time period (a randomly generated number, between
10 and 1000 per month).
·
“Stickers”, who sold between a lower and upper
limit of copies of book each time period, but had a slightly higher number for
the lower limit, which could be varied (a randomly generated numbers between a
variable lower limit and 1000 copies per month).
I then generated twelve months of artificial data, showing
the percentage of books that were “Non-Stickers” each month versus the
percentage that were “Stickers”. Note
that the “Stickers” have a slight edge in book sales in the non-control scenarios,
but only a slight edge. There were ten
trials performed under each set of assumptions, to ensure that the random
number generator resulted in a good
representation of the underlying statistical assumptions (i.e. utilizing the
Central Limit Theorem aka the Law of Large Numbers, which simply means that as
you do more trials your results will become closer and closer to the
theoretical assumptions in your model).
The first two graphs show the results of having a dataset
of 64% “Stickers”/36% “Non-Stickers”, with each group randomly selling
somewhere between 10 and 1000 books per month.
I chose the 64/36 ratio, because that is the proportions of Trads to
Indies in Hugh Howie’s dataset of 50,000 Amazon books. This is the control scenario, where Stickers
and non-Stickers sell the same number of books per month, on average. That would be 505 books each, the result of a
uniform random number generator, that picked a number between 10 and 1000 each
time, with each number having the same probability of being chosen.
I then varied the lower limit of books sold for the “Stickers”,
raising it slightly with each model run, while keeping it the same for the
“Non-Stickers”. The results of half a
dozen model runs are shown below, varying the lower limit each time. As you can see, the Howie/DataGuru results
are reproduced when the “Stickers” have a lower bound of about 60 sales per
month. That would imply an average of
about 530 books per month, to the Indies average of 505 books per month. It corresponds to a difference that hardly
shows up in the monthly data, but is very noticeable in the annual data.
The exact numbers for the six model runs are shown below, along with a
graph of the results.
Lower Bound
|
Upper
Bound
|
Annual, Top Percentile
|
Monthly, Top Percentile
|
10
|
1000
|
64%
|
64%
|
50
|
1000
|
68%
|
65%
|
62
|
1000
|
76%
|
67%
|
75
|
1000
|
79%
|
67%
|
100
|
1000
|
81%
|
67%
|
125
|
1000
|
90%
|
67%
|
So, projecting these results into the Trad/Indie results,
it is clear that if the Trad published books
tended to be only a little more consistent in their monthly sales
results, they could quite easily have about 76% of the books in the Amazon Top
100 for the Year 2013, but only about 64% in a daily snapshot early in February
2014.
Obviously, this exercise doesn’t prove that this is what
happened, but it does show that it is quite plausible. Furthermore, if the “stickiness factor” isn’t
related to publisher category, but rather to length of time that a writer has
been in the public eye, then this Trad/Indie difference will wither away, as
Indies have more time to establish themselves in the marketplace.
Thursday, 20 March 2014
Measuring Luck
Measuring Luck
In the story “A Dark Horse”, in the collection Northern
Gothic Stories, Daniel Foster, a gambler, has a difficult time with lucky
streaks, bad and good. He worries that
perhaps something more than the vagaries of random chance are at work, perhaps
even something diabolical.
Anyway, a
while back, I was locked into the deepest losing streak I'd ever known, maybe
the deepest losing streak anyone has ever known. At least that's how it seemed
to me.
I'd been six straight weeks
without a winning day. Hell, at one point I'd gone fifty-seven straight races
without seeing the cashier's window. The
odds against that must be astronomical.
A dead man could do better. I mean, pure dumb luck ought to count for
something. At any rate, I was feeling pretty desperate.
…
My luck changed dramatically
and all for the better. It's been the
hottest winning streak I've ever had, for all I know it's been the hottest
winning streak of all time.
I don't know and at this
point I really don't care. You see, this streak has been too much, too unreal
for me to feel comfortable with. I like
to win. Every gambler likes to win. Hell, everyone likes to win, gambler or not.
But this thing - I don't know. I think I preferred the losing streak.
The thing that's really got
me are those damned dreams. Every night it's the same thing. They follow the
script, the one I described before. I
get up and go to the desk. The dark man
refuses to tell me his name. We make a bargain, and we seal it with a drink. As
I turn to leave, I hear the name of a horse.
The next day that horse is on the card.
If I bet the horse, he wins. If I don't bet he loses. I've made over $200,000 in the last month
alone, and I'm not even trying. A horse
player's dream, right? I'd give it all up for a good night's sleep.
I think we have all been through something like this in our
lives, whether or not it involved gambling.
It can be exhilarating to be on a hot streak, and devastating to be on a
cold streak. It feels like the universe
has singled you out, for better or worse.
So, just how do you measure how likely or unlikely a hot streak or cold
streak is?
First off, it helps to have a well defined measure of success
or failure. In Daniel Foster’s case, it
was success at the track, which can be measured in percentage of wins and money
won or lost. In other cases, it can be
more nebulous - how do you measure “lucky in love”, for example?
So let’s stick with something easy. In this case, we will look at the 20 year
streak of ineptitude for Canadian NHL teams, in which they have not won a
single Stanley Cup (to be fair, it is really 19, after excluding the lockout
year). Is this just a streak of bad luck,
or is something else at work?
There are a number of ways to tackle this problem, so we
will go in order, from “common sense” methods, to physical modelling (via
playing cards), to computer modelling (via excel), to theoretical mathematical methods (via the binomial
theorem). Pick the one you understand
the best and like the best.
We begin by looking at what percentage of NHL teams were Canadian
during this time span, which turns out to be 21.2% overall, varying from a high
of 23.3% to a low of 20%. So, naively,
we would expect a Canadian team to win the Stanley Cup every 4 or 5 years,
which corresponds to the proportion of teams in the league. So, 19 years does seem like a pretty long
time to go without a Stanley Cup. Our
naïve statistical sense tells us this is about 4 or 5 times longer than we
would expect. Carrying on further
with our naïve statistical instincts, we might say:
This reasoning isn’t actually valid, but I think it gives a feel for how people reason about things like this. Plus, it does give an answer that is accurate, to a first approximation. It tells us that we wouldn’t expect a run like this very often.
·
There is roughly a 50% chance of a run of 5.
·
So, there is roughly a 25% chance of a run of 10
(half of 50%).
·
Then, there is roughly a 12.5% chance of a run
of 15 (half of 25%).
·
Giving a 6.25% chance of a run of 20 (half of
12.5%), more or less.
So, using our naïve probabilistic reasoning, we
think a run of 19 or 20 years without a Canadian team winning the Stanley Cup
is a pretty low likelihood event (at about 6.25%), but not alarmingly so. This reasoning isn’t actually valid, but I think it gives a feel for how people reason about things like this. Plus, it does give an answer that is accurate, to a first approximation. It tells us that we wouldn’t expect a run like this very often.
What’s our next effort to figure this out?
This time, let’s try an experiment, using a real-world
situation to model our problem. To do
so, I took a deck of playing cards, and let the suit Clubs represent Canadian
hockey teams (Montreal’s team is called the Canadiens’ Hockey Club, so I
thought it appropriate to let Clubs represent the Canadian hockey clubs). There are 13 Clubs in a deck of 52 cards, so
they represent 25% of the deck. If we
remove the King and Queen of Clubs, then that suit has 11 cards out of the 50
remaining, representing 22% of the cards in the reduced deck. That’s as close as we can get to the 21.2% of
Canadian teams in the NHL during the period in question, so we will go with that.
So, now we take our modified deck of playing cards, and
simulate the hockey problem by:
·
Shuffling the deck thoroughly.
·
Dealing cards out until we come to a Club,
counting the number of cards as we do so.
·
Recording the length of the run of non-Clubs.
·
Repeat this as many times as you like (I did 100
trials).
My results are given below:
Run
Length
|
Frequency
|
Percent
|
1
|
20
|
20%
|
2
|
18
|
18%
|
3
|
6
|
6%
|
4
|
13
|
13%
|
5
|
10
|
10%
|
6
|
5
|
5%
|
7
|
8
|
8%
|
8
|
3
|
3%
|
9
|
4
|
4%
|
10
|
3
|
3%
|
11
|
3
|
3%
|
12
|
0
|
0%
|
13
|
3
|
3%
|
14
|
1
|
1%
|
15
|
1
|
1%
|
16
|
1
|
1%
|
17
|
0
|
0%
|
18
|
1
|
1%
|
19
|
0
|
0%
|
20
|
0
|
0%
|
20+
|
0
|
0%
|
Total
|
100
|
As you can see, the longest run was 18, not quite as long as
the number of years that Canadian teams have gone without winning a Stanley
Cup. So, it would appear that a run this
long is less likely than our naïve statistical sense told us. In fact, it appears that a run of 19 or 20
has a likelihood of coming up about once every 100 trials, at best.
For the heck of it, I tried this again, though this time I
didn’t re-shuffle after hitting a Club, but dealt the deck on to
exhaustion. Basically, this was faster,
though it wasn’t as statistically rigorous, since in any one shuffled deck, the
short runs and long runs would be anti-correlated (sort of like the idea behind card counting
in blackjack). Anyway, here are those
results, this time using 250 trials:
Run Length Frequency Percent
1
|
47
|
19%
|
||
2
|
56
|
22%
|
||
3
|
27
|
11%
|
||
4
|
28
|
11%
|
||
5
|
25
|
10%
|
||
6
|
14
|
6%
|
||
7
|
13
|
5%
|
||
8
|
7
|
3%
|
||
9
|
6
|
2%
|
||
10
|
6
|
2%
|
||
11
|
3
|
1%
|
||
12
|
3
|
1%
|
||
13
|
5
|
2%
|
||
14
|
2
|
1%
|
||
15
|
1
|
0%
|
||
16
|
1
|
0%
|
||
17
|
2
|
1%
|
||
18
|
2
|
1%
|
||
19
|
1
|
0%
|
||
20
|
0
|
0%
|
||
20+
|
1
|
0%
|
||
Total
|
250
|
|||
This time we hit one run of 19 and one run of 20+, so that’s
2 out of 250, or a little under 1%. Surprisingly
enough, the longest run was a run of 29.
It’s always weird to witness a such a low probability event happen
before your eyes, even if it means very little in the real world.
So, I think we can safely say from the experimental
evidence, that a run of 19 years without a Canadian team winning the Stanley
Cup is a pretty unusual event (odds are
on the order of 1%), if it is just a matter of random chance.
I also set up a Monte Carlo simulation in Excel. In that one, I simulated a 20,000 year NHL
history (yes, it is a bit excessive), and counted how many times a run of 19 or
more came, from random chance, given a Canadian team representation of 21.2% of
all teams. In 100 trials of this simulation,
the median percentage of runs of that length was a bit under 1.0%. That conforms nicely to our playing card
experiment.
Finally, we can look at this as a simple binomial
distribution probability problem (you may remember this from high school or
university math courses), with p=.21 and n=19, and consult a table like this
one:
Doing that, we also find the probability of a 19 year run of
no Canadian Stanley Cups to be about 1%.
The nice thing about the playing card simulation, is that it
is easy to understand and anyone can do it if they choose to spend half an hour
or so shuffling and dealing cards (perhaps while watching their favorite
Canadian hockey team scrub out of the playoffs). You don’t need a strong math background, just
common sense.
So, have we discovered whether the lack of success of
Canadian hockey teams is just one of those things, or is there something deeper
at work? I suppose it is still a
judgement call, but it does make you wonder.
Every year this goes on, makes you wonder even more. We’ll leave it at that for now.
Subscribe to:
Posts (Atom)