Saturday, 22 February 2020

Estimating the Corona Virus (Covid-19) Transmission Rate, from the Diamond Princess Data


Estimating the Corona Virus (Covid-19) Transmission Rate, from the Diamond Princess Data

Since the outbreak of the Novel Coronavirus 2019-nCoV (now called Covid-19), there have been numerous questions concerning the nature and effects of the virus.  One of the key questions, that is obviously of great interest to everybody, is just how fast it can spread.

The cruise ship, Diamond Princess, presents us with a “natural experiment”, that can provide some clues.  We have good data on cases per day, which should be quite reliable, as the crew and passengers were under constant scrutiny, as to whether they were showing symptoms, and were being tested quite soon thereafter.  Given the attention of the world on the situation, it seems probable that the data would have been accurately reported.

It is worth noting that the Diamond Princess case is different from the virus in the wild, so to speak, for a number of reasons:


  • Great efforts were being made to prevent the spread of the virus within the ship, as it remained in harbor, with strict rules about access to the ship, travel within the ship and contact between people on board (passengers and crew).  That is, after all, what is meant by quarantine.  Granted, there are questions about just how effective those efforts were, but that should have reduced risk of transmission, at least in theory.

  • Conversely, though, keeping all of these people in close quarters, with a virulent virus on board, presents a greater than average risk of any particular person in this limited population coming into contact with the virus.  Even though great efforts were presumably being made to enforce the quarantine, the situation was quite favourable, from the virus’s point of view.  As has been said, it was a sort of giant petri dish.

Given those facts, it is hard to say how generalizable the data on spreading is to other conditions, such as an urban environment.  Nonetheless, it is worth examining and learning what lessons we can, keeping in mind these caveats.  

Here are some graphs showing the progress of the disease, while the ship was in quarantine, in numbers of cases.  The graphs show the numbers of cases reported, a best-fit line that gives an idea of the underlying mathematical function, the statistical properties of that best-fit function (equation and R-square) and a projection of future cases predicted by the function, had the situation remained unchanged over the next couple of weeks.  Note that this data is publicly available, including the ship’s website.

I will present these graphs ascending order of their R-square.  This is a statistical measure that gives an idea of how closely the data fits the best-fit function; the higher the R-square, the better the data fits the functional form.  An R-square of 1 is a perfect fit (i.e. no error-term between the actual data and what would be predicted by the functional form).  Any R-square close to 1 is a good fit, though just how good is a bit of a judgement call.


Case 1 – Linear Relationship

This is the best case scenario, where the spread of the virus is slowest.  The graph shows a relatively slow but steady increase in cases.  In this scenario, the number of cases would remain below 1000 until two more weeks had passed.

Though the actual data (the blue points) lie relatively close to the line, the fit doesn’t look that great.  The R-square is fairly high at 0.855, though.




This relationship seems fairly unlikely on theoretical grounds. It implies that the same amount of new people become infected every day.  That could happen, but the rate of transmission from person to person would be quite low.  However, if the amount of time an infected person was infectious to others was rather short and contact was limited, something like this might prevail.  Basically, although the number of people who had been infected would grow, the number that were actually infectious at any point in time and therefore could spread the disease, would remain stable.


Case 2 – Exponential Relationship

This is probably the worst case scenario, where the spread of the virus is the most rapid.  The best fit line on the graph shows a rapid rise in cases, with the numbers exploding to infinity, as it is often said.  In fact, based on this functional relationship, everyone on board the ship would be expected to fall ill before another week was out (the ship’s complement of passengers and crew totalled 3711 people.

Again, the actual data (the blue points) lie reasonably close to the line, but visual impression of the fit isn’t that great, especially the lack of fit in the last three points.  At 0.859, the R-square is nearly identical to the linear relationship.



On theoretical grounds, a pandemic can grow at an exponential rate, at least for a while.  If each infected person can infect several people, the pandemic can grow at a very rapid clip.  Of course no function in the real world can remain exponential for too long – in the case of a virus it will eventually run out of hosts as it grows, the hosts will develop immunity or succumb to the disease and die.  Eventually, the function must turn  downward.

Case 3 – Quadratic Relationship

The quadratic form is a second order polynomial, which can also indicate rapid growth for an underlying phenomenon, though not as rapid as exponential growth.  This growth rate is faster than the linear model, but slower than the exponential.  This function predicts that about 2000 people would be infected within two weeks, over half the people on the ship.

In this case, the data points appear to fit the function very well, with some points slightly below the line and some slightly above (technically it is not heteroscedastic, as the exponential was).  The R-square is very high at 0.964.



As stated earlier, the quadratic model indicates that there is both a linear trend, and a second order trend to the data.  The latter means that the rate of change is accelerating, so to speak, as the days go on.

Like the exponential, a second order relationship can only go on for so long in the real world.  It will eventually be bounded by real world constraints.


Case 4 – Power Law Relationship

The quadratic form is a special case of a power law, where the exponent is equal to the integer 2.  The exploratory power law model shown below is very close to the quadratic model.  It would also have about half of the people on the ship sick within a couple of more weeks.



In this case, the data points also appear to fit the function very well, closely resembling the quadratic.  The R-square is slightly higher, at 0.988.



The Death Rate on Diamond Princess

At the time that passengers are being transferred to other locations, there were only 2 deaths among the 634 cases.  We don’t know when those deaths occurred for sure, but will make the assumption that they were on the last day (Feb 20, 20 days after the start of the outbreak, which we will place on Feb 1).

Given those parameters, we can calculate a rough death rate, assuming different lag times for median time from diagnosis to death.  Doing that gives the graph below, indicating a death rate of somewhere between 2 and 5%, assuming that the latency period is between a week and two weeks.  These are admittedly rough figures, but they correspond fairly well with the experience in China, which gives a fatality rate of about 5 to 6 percent, using a similar lag time.



It is difficult to extrapolate the death rates that might occur in less selected populations than a cruise ship.  On the one hand, cruise ship populations skew older, which could lead to higher fatality rates than in a more general population.

But on the other hand, people generally don’t take cruises if they are in extremely bad health or are extremely old.  Plus, a cruise ship population will be drawn from economically well off populations, who have benefitted from good health care all their lives.  So, these factors might tend to indicate a lower death rate.

At any rate, the ship quarantine has now been called off, and people are being air-lifted back to their home countries or to mainland Japan, though they may well continue being kept in quarantine in those locations.  So, this interesting natural experiment is now over.  Here’s hoping that epidemiologists and other public health workers learn some useful lessons from it, statistical and otherwise.




And, here’s a more pleasant travel story than anticipating the worldwide journey of a virus.

On the Road with Bronco Billy

What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.

Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.

Monday, 10 February 2020

Estimating a More Realistic Fatality Rate of the Coronavirus, from Publicly Available Data


Estimating a More Realistic Fatality Rate of the Coronavirus, from Publicly Available Data

Since the outbreak of the Novel Coronavirus (2019-nCoV), there have been numerous statistics concerning the nature and effects of the virus.  One of the key statistics, that is obviously of great interest to everybody, is the fatality rate.  These numbers have been widely disseminated by governments and published in the press.  One of the key numbers is the fatality rate - more on that below, after some initial remarks about the progress of the disease.

Here are a couple of graphs showing the progress of the disease, in numbers of cases and fatalities.  The first shows the numbers with conventional axes, cases on the right axis and fatalities on the left axis.  As you can see, the lines follow the same general functional shape, to a considerable level of agreement.

The functions also appear to increasing at more than a linear rate, likely some form of an exponential function, as might be expected during the early phase of an epidemic.  Of course no function in the real world can remain exponential for too long – in the case of a virus it will eventually run out of hosts as it grows, with the hosts will developing immunity or succumbing to the disease.  Eventually, the function turns downward, looking more like a quadratic, but we seem to be far from that, at the present moment.



The second shows the same data, but with a logarithmically scaled axis.  Best fit lines are also shown, as well as the accompanying function form and R-square.  The latter is a measure of how well the equation fits the data, with a value of 1 indicating a perfect fit.  The fatality function appears to be a better fit to an exponential than the total cases function, both visually (on the logarithmic plot, it approximates a straight line) and in terms of its R-square, which is high at 0.96.  The total cases curve is visually a less impressive fit, and its R-square is lower at 0.90.

Note that the first few periods depart from the exponential form, which is likely due to measurement uncertainties in the early days of reporting.  With relatively low counts, these uncertainties create a lot of noise in the data, though that tends to settle down as the Ns go up.  Though I didn’t put in piecewise smooth lines, you can see by eye that the functions “straighten out”, to a considerable degree after period 4 to 6.

 

This fact that the total cases line has a lower R-square than the total deaths line seems to be explainable by the different degrees of measurement error possible between total cases and total deaths.  The former has more room for error – for example, some cases may be asymptomatic or very mild, and therefore might never be reported and thus not included in statistical totals.  Deaths, on the other hand, are hard to miss, and though the cause of death may still sometimes be misreported, deaths still demand an explanation, so are more likely to be accurately identified and reported.

The same is true of the tendency of authorities to play down numbers to avoid a panic – it is easier to hide mild cases than deaths, so the number of cases is less likely to be accurately captured than the number of deaths (not that I am accusing anybody of doing that intentionally).

That brings us to the matter of how the fatality rate is calculated and reported.  It has generally been stated to be around 2 to 3 percent, with the value settling into a recent trend of about 2.1 per cent, in the statistics that I have read.

The fatality rate seems to be simply calculated as:

 (deaths up to time T)/(cases up to time T)

The actual numbers are given in the table below and in the graph.  As you can see the fatality rate is much higher at the start of the time series, then settles down to a fairly steady 2.0 to 2.5 percent rate.
Date
Deaths
Cases
Fatality Rate
23-Jan-20
25
265
9.4%
24-Jan-20
41
733
5.6%
25-Jan-20
56
1436
3.9%
26-Jan-20
80
2222
3.6%
27-Jan-20
106
4000
2.7%
28-Jan-20
132
5482
2.4%
29-Jan-20
170
7237
2.3%
30-Jan-20
213
9242
2.3%
31-Jan-20
259
11369
2.3%
1-Feb-20
304
13972
2.2%
2-Feb-20
362
16808
2.2%
3-Feb-20
426
20047
2.1%
4-Feb-20
492
23974
2.1%
5-Feb-20
565
27697
2.0%
6-Feb-20
638
30860
2.1%
7-Feb-20
724
34297
2.1%
8-Feb-20
813
36973
2.2%



However, this manner of calculating the rate is misleading when an epidemic is growing  or shrinking.  When the epidemic is growing, it will tend to under-report the real fatality rate, and once the corner is turned on the epidemic, it will tend to over-report the rate.

The reason is that deaths are a lagged variable in the time series, compared to cases.  There is a time lag from being exposed to the virus and being infected, to coming down with the illness, to ultimately dying from it.  So, the death rate shouldn’t be calculated as current deaths divided into current cases, but current deaths divided into cases at some earlier time.

How long the time lag should be for the equation will be a function of the average time between infection and death.  That can only be calculated by following a cohort of patients in the early stages of the disease and noting the time to death.  One study gives this time as about 7 days, though it would take a lot more data to be really sure of this number:

“A study on 138 hospitalized patients published on February 7 on JAMA, found that the median time from first symptom to dyspnea was 5.0 days, to hospital admission was 7.0 days, and to ARDS was 8.0 days.”


Using this data gives the results of the graphs below.  The first graph divides the number of new deaths reported at time T into the number of new cases at various time lags, in days.  The case for lag of 7 days is highlighted; it gives a fatality rate of about 6.3%.



The second graph calculates the rate as the total number of deaths at time T divided into the total number of cases at time T-Lag, in days.  It gives a slightly lower fatality rate at the 7 day lag, of about 5.6%.  I should note that I clipped out the first few data points from this calculation, as the early counts had low Ns for both cases and fatalities, which yield unrealistic numbers.  By waiting a bit, the numbers settle down, as the Ns increase.  It is interesting that this method of calculating the fatality rate gives a more stable number than the first.  Nonetheless, it is notable that both of these simple calculations yield about the same fatality rate for a 7 day lag, about 6 percent.



Obviously, there is still a lot of uncertainty about the data and the epidemic is still in its early stages, at least as far as we know.  But these estimates of fatality rates seem likely to be more accurate than what we are usually seeing, as things develop.  They also align better with some earlier similar epidemics, such as SARS, which had a fatality rate of about 9%.

There is a certain symmetry in the situation – just as the currently used measure underestimates the danger of the virus, it will overestimate the danger at the time when the danger is actually going down.  At that time, authorities will either have to restate the calculation in more favorable terms (losing public trust by presenting mixed messages about how best to measure the fatality rate) or attempt to communicate the fact that things are actually getting better when the faulty measure says that they are getting worse (which also will create a public trust problem).

Let’s hope that the situation remains manageable and that people “remain calm and carry on”.  Personally, I think the China has reacted quite responsibly to the outbreak.  I am not sure if western governments could have taken the similarly strong measures that seem necessary.  Various economic and privacy rights concerns would have made that extremely difficult.  However, there may be tests to come, if the epidemic progresses unfavorably.



And, here’s a more pleasant travel story than anticipating the worldwide journey of a virus.

On the Road with Bronco Billy

What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.

Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.