Tuesday 8 October 2019

Some Non-Random Thoughts About the Importance of Randomization


Some Non-Random Thoughts About the Importance of Randomization

I was asked a question about randomization on Quora, which I think is of interest to a more general audience, as well.  So, my comments are below:

What is randomization? What are the different methods of random selection?

In statistical and data science, randomization is important for a number of reasons, both theoretical and practical.  Here is a short passage from the Collins Dictionary of Statistics (2005 edition):

“Randomization is the process ensuring that, when possible, the elements in a statistical experiment are carried about in a random order.
Randomization is one of the key principles in designing an experiment.  It is a safeguard against systematic errors.”

Here are some of my own observations as a practicing statistician involved in operational research - much more could be said on the subject (and has been by some extremely clever people over the centuries):

1 - Surveys

In surveys, randomization is the key principle for a scientific sample, one where you can abstract results from the sample to the larger population from which it is drawn. To pick the random sample, you have to have a representative sample frame, which allows you to select the sample from the population of interest. If you can be sure that the sample frame includes all cases in the population of interest (e.g. a list of every worker in a firm), and that the people selected will actually answer your survey, you can pick a random sample of those people with confidence that their answers will be very similar to what you could get from surveying the entire population of interest.

These days, however, people are fatigued with surveys (there are too many of them), so selection bias is a real problem.  Simply put, most people hang up the phone on the surveyor, so it isn’t possible to be sure that the few that answer are typical of the population.  Because of that, the results of the survey can’t really be trusted.  Election polling has had problems during the recent past, often predicting incorrect outcomes.  And as polls lose credibility, people become even less willing to participate, which exacerbates the problem.

Nobody has yet come up with a solution to this issue.  Perhaps in the future, the government will have to empanel “statistical juries” to ensure that a random sample is used for important public questions.

2 - Experimental Design

In experimental science in the physical sciences, the object of an experiment is to hold all circumstances of the experiment constant, varying only the variable of interest, and measure how that affects the outcome of interest (e.g. vary the length of a pendulum and see how that affects the period).  But in many areas of interest (e.g. medicine, agriculture) that ideal is not possible to achieve – you just can’t hold everything constant in complicated environments such as the human body or the natural world.  Thus, the principle of randomization has been developed.

In carefully designed experimental designs used for these types of disciplines, randomization ensures that you don’t have to control for all of the many possible confounders that might influence your dependent variable, that are not the main interest of the study. Basically, the argument is that the randomization ensures that the effects of these potential confounders will “wash out”, so to speak.

There are many explicit experimental designs that have been developed for this purpose (split plot, Latin squares, etc.). These have generally been developed in such a way that the maximum amount of information can be extracted from the minimum number of cases in the study.  Depending on the experimental design, one or “a few” (not many) variables are explicitly varied, and randomization takes care of the multitude of other potential confounders. 

Experiments can be expensive (researcher time, equipment costs, administration, etc.), and/or relevant cases can be difficult to find and enrol (e.g. a study for treatment of a rare disease).  Via judicious choice of experimental design, these costs can be restrained, since randomization does a lot of the heavy lifting.

Still, much depends on the number of cases that you can enroll in the study. The more the better, from a statistical point of view, though that drives up costs.  A long-standing practical phrase in statistical science is “get more data” (first attributed to Ronald Fisher, a major figure in the development of statistical science).

3 - Observational studies

With observational studies, you can’t apply these nice experimental designs, so you try to include as many variables as you reasonably can in the study, apart from your main variable of interest (and as many cases as you can).  Since you can’t let randomization do the work of accounting for confounders, you have to try to include as many potentially confounding variables as you can in your data, and use statistical modelling methods, such as regression analysis. 

For example, the amount of cigarette smoking that people do might be your main variable of interest, but other things also influence health (e.g. diet, age, exercise, gender, SES, race, etc.), so you try to include as many of these in your model as is possible. You might check your study population for some fundamental characteristics (e.g. age, gender) against your wider population of interest, as a cross-check, but you can can’t check against every possible confounder, so there is always some room for doubt about how generalizable your results really are.

4 - Monte Carlo Based Algorithms and Simulations

A lot of the newer “data science” algorithms rely on drawing random samples from a dataset, for various reasons (Monte Carlo, bootstrapping, etc.).  For example, the Random Forests technique is an elaboration of the Decision Trees method, that uses multiple subsamples from the same dataset, drawn at random, to improve predictive accuracy, via “bootstrapping”.

There are also many simulation methods, such as agent-based models that make use of randomization.  These have nothing to do with spy agencies (though spy agencies could use them too) – rather, they are computer models that have different “agents” interact within the model, to simulate reality, based on computer rules that are meant to model the real world.

A military application might have an anti-missile system and an offensive missile system “play off” against each other, within the simulation world, to see how effective different strategies or tactics would be.  The simulation model would play multiple scenarios, randomly changing certain parameters and/or moves for the various players.   Typically, it would then report back the most optimum strategies for success (depending on whether you are interested in defence or attack).

Sources of Randomization

There are many ways to do randomization, but these days a computer based random number generator is the most common choice, for most purposes (these are actually pseudo-random, though). I have a big text about random number generators, so it is not a simple concept to operationalize.

It is also remarkably difficult to verify that a run of numbers is “truly random”.  A human being isn’t that good at generating a random series off the top of his or her head.  They tend to have fewer runs of the same number than a real random process generates.  Discovering that a system that should be truly random isn’t really random is a good way of detecting fraud.

In the past, large books of random numbers were published, and schemes for selecting numbers from that book were used. A Million Random Digits with 100,000 Normal Deviates was published by the Rand Corporation, for example.

For scientific studies that require absolutely perfect random numbers, so to speak, randomization can be based on some natural process that is known to be completely random (e.g. radioactive decay, since that is a quantum mechanics based process, and thus about as random as you can get).

Randomness in Regular Life

It is a good idea to keep randomness in mind, in regular life.  For example, some hedge fund operators have excellent reputations, but it can turn out that these are just long runs of random “luck” breaking their way.  With enough hedge funds out there, somebody has to be extremely successful, just by random chance.

Similarly for cultural products, like books and movies.  A successful person can start off with a good run of luck, then remain on top because “nothing succeeds like success”. It is easy to get fooled in such matters, into thinking there is some unique talent at work, rather than dumb luck.

The same is true for random runs of bad results, generally known as “bad luck”.  A few bad cards at the beginning of the poker game, and you are busted, doesn’t necessarily mean that you are a bad player – that’s just life.  But if it happens most of the time, that’s another story.  

And, because cartoons are fun, here are a couple from XKCD (not drawn at random):





Now that you have read about randomness, why not read about a road trip, which is much more fun than a random walk.

A Drive Across Newfoundland


Newfoundland, Canada’s most easterly province, is a region that is both fascinating in its unique culture and amazing in its vistas of stark beauty. The weather is often wild, with coastal regions known for steep cliffs and crashing waves (though tranquil beaches exist too). The inland areas are primarily Precambrian shield, dominated by forests, rivers, rock formations, and abundant wildlife. The province also features some of the Earth’s most remarkable geology, notably The Tablelands, where the mantle rocks of the Earth’s interior have been exposed at the surface, permitting one to explore an almost alien landscape, an opportunity available on only a few scattered regions of the planet.

The city of St. John’s is one of Canada’s most unique urban areas, with a population that maintains many old traditions and cultural aspects of the British Isles. That’s true of the rest of the province, as well, where the people are friendly and inclined to chat amiably with visitors. Plus, they talk with amusing accents and party hard, so what’s not to like?

This account focusses on a two-week road trip in October 2007, from St. John’s in the southeast, to L’Anse aux Meadows in the far northwest, the only known Viking settlement in North America. It also features a day hike visit to The Tablelands, a remarkable and majestic geological feature. Even those who don’t normally consider themselves very interested in geology will find themselves awe-struck by these other-worldly landscapes.

No comments:

Post a Comment