Some Non-Random Thoughts About the Importance of Randomization
I was asked a question about
randomization on Quora, which I think is of interest to a more general
audience, as well. So, my comments are
below:
What is randomization? What are the different methods of random selection?
In statistical and data science, randomization is important
for a number of reasons, both theoretical and practical. Here is a short passage from the Collins Dictionary of Statistics (2005
edition):
“Randomization is the process
ensuring that, when possible, the elements in a statistical experiment are
carried about in a random order.
Randomization is one of the key
principles in designing an experiment.
It is a safeguard against systematic errors.”
Here are some of my own observations as a practicing
statistician involved in operational research - much more could be said on the
subject (and has been by some extremely clever people over the centuries):
1 - Surveys
In surveys, randomization is the key principle for a
scientific sample, one where you can abstract results from the sample to the
larger population from which it is drawn. To pick the random sample, you have to
have a representative sample frame, which allows you to select the sample from
the population of interest. If you can be sure that the sample frame includes
all cases in the population of interest (e.g. a list of every worker in a firm),
and that the people selected will actually answer your survey, you can pick a
random sample of those people with confidence that their answers will be very
similar to what you could get from surveying the entire population of interest.
These days, however, people are fatigued with surveys (there
are too many of them), so selection bias is a real problem. Simply put, most people hang up the phone on
the surveyor, so it isn’t possible to be sure that the few that answer are
typical of the population. Because of
that, the results of the survey can’t really be trusted. Election polling has had problems during the
recent past, often predicting incorrect outcomes. And as polls lose credibility, people become
even less willing to participate, which exacerbates the problem.
Nobody has yet come up with a solution to this issue. Perhaps in the future, the government will
have to empanel “statistical juries” to ensure that a random sample is used for
important public questions.
2 - Experimental Design
In experimental science in the physical sciences, the object
of an experiment is to hold all circumstances of the experiment constant,
varying only the variable of interest, and measure how that affects the outcome
of interest (e.g. vary the length of a pendulum and see how that affects the
period). But in many areas of interest
(e.g. medicine, agriculture) that ideal is not possible to achieve – you just
can’t hold everything constant in complicated environments such as the human
body or the natural world. Thus, the
principle of randomization has been developed.
In carefully designed experimental designs used for these types
of disciplines, randomization ensures that you don’t have to control for all of
the many possible confounders that might influence your dependent variable,
that are not the main interest of the study. Basically, the argument is that
the randomization ensures that the effects of these potential confounders will
“wash out”, so to speak.
There are many explicit experimental designs that have been
developed for this purpose (split plot, Latin squares, etc.). These have generally
been developed in such a way that the maximum amount of information can be extracted
from the minimum number of cases in the study.
Depending on the experimental design, one or “a few” (not many)
variables are explicitly varied, and randomization takes care of the multitude
of other potential confounders.
Experiments can be expensive (researcher time, equipment
costs, administration, etc.), and/or relevant cases can be difficult to find
and enrol (e.g. a study for treatment of a rare disease). Via judicious choice of experimental design,
these costs can be restrained, since randomization does a lot of the heavy
lifting.
Still, much depends on the number of cases that you can
enroll in the study. The more the better, from a statistical point of view,
though that drives up costs. A
long-standing practical phrase in statistical science is “get more data” (first
attributed to Ronald Fisher, a major figure in the development of statistical
science).
3 - Observational studies
With observational studies, you can’t apply these nice
experimental designs, so you try to include as many variables as you reasonably
can in the study, apart from your main variable of interest (and as many cases
as you can). Since you can’t let randomization
do the work of accounting for confounders, you have to try to include as many
potentially confounding variables as you can in your data, and use statistical
modelling methods, such as regression analysis.
For example, the amount of cigarette smoking that people do
might be your main variable of interest, but other things also influence health
(e.g. diet, age, exercise, gender, SES, race, etc.), so you try to include as
many of these in your model as is possible. You might check your study
population for some fundamental characteristics (e.g. age, gender) against your
wider population of interest, as a cross-check, but you can can’t check against
every possible confounder, so there is always some room for doubt about how
generalizable your results really are.
4 - Monte Carlo Based Algorithms and Simulations
A lot of the newer “data science” algorithms rely on drawing
random samples from a dataset, for various reasons (Monte Carlo, bootstrapping,
etc.). For example, the Random Forests
technique is an elaboration of the Decision Trees method, that uses multiple subsamples
from the same dataset, drawn at random, to improve predictive accuracy, via “bootstrapping”.
There are also many simulation methods, such as agent-based
models that make use of randomization.
These have nothing to do with spy agencies (though spy agencies could
use them too) – rather, they are computer models that have different “agents”
interact within the model, to simulate reality, based on computer rules that
are meant to model the real world.
A military application might have an anti-missile system and
an offensive missile system “play off” against each other, within the
simulation world, to see how effective different strategies or tactics would
be. The simulation model would play multiple
scenarios, randomly changing certain parameters and/or moves for the various
players. Typically, it would then
report back the most optimum strategies for success (depending on whether you
are interested in defence or attack).
Sources of Randomization
There are many ways to do randomization, but these days a
computer based random number generator is the most common choice, for most
purposes (these are actually pseudo-random, though). I have a big text about
random number generators, so it is not a simple concept to operationalize.
It is also remarkably difficult to verify that a run of
numbers is “truly random”. A human being
isn’t that good at generating a random series off the top of his or her head. They tend to have fewer runs of the same
number than a real random process generates.
Discovering that a system that should be truly random isn’t really
random is a good way of detecting fraud.
In the past, large books of random numbers were published,
and schemes for selecting numbers from that book were used. A Million
Random Digits with 100,000 Normal Deviates
was published by the Rand
Corporation, for example.
For scientific studies that require absolutely perfect
random numbers, so to speak, randomization can be based on some natural process
that is known to be completely random (e.g. radioactive decay, since that is a
quantum mechanics based process, and thus about as random as you can get).
Randomness in Regular Life
It is a good idea to keep randomness in mind, in regular
life. For example, some hedge fund
operators have excellent reputations, but it can turn out that these are just
long runs of random “luck” breaking their way.
With enough hedge funds out there, somebody has to be extremely successful,
just by random chance.
Similarly for cultural products, like books and movies. A successful person can start off with a good
run of luck, then remain on top because “nothing succeeds like success”. It is
easy to get fooled in such matters, into thinking there is some unique talent
at work, rather than dumb luck.
The same is true for random runs of bad results, generally
known as “bad luck”. A few bad cards at
the beginning of the poker game, and you are busted, doesn’t necessarily mean
that you are a bad player – that’s just life.
But if it happens most of the time, that’s another story.
And, because cartoons are fun, here are a couple from XKCD
(not drawn at random):
Now that you have read about randomness, why not read about a
road trip, which is much more fun than a random walk.
A Drive Across Newfoundland
Germany: https://www.amazon.de/dp/B07NMR9WM8
Australia: https://www.amazon.com.au/dp/B07NMR9WM8
Newfoundland, Canada’s most easterly province, is a region
that is both fascinating in its unique culture and amazing in its vistas of
stark beauty. The weather is often wild, with coastal regions known for steep
cliffs and crashing waves (though tranquil beaches exist too). The inland areas
are primarily Precambrian shield, dominated by forests, rivers, rock
formations, and abundant wildlife. The province also features some of the
Earth’s most remarkable geology, notably The Tablelands, where the mantle rocks
of the Earth’s interior have been exposed at the surface, permitting one to
explore an almost alien landscape, an opportunity available on only a few
scattered regions of the planet.
The city of St. John’s is one of Canada’s most unique urban
areas, with a population that maintains many old traditions and cultural
aspects of the British Isles. That’s true of the rest of the province, as well,
where the people are friendly and inclined to chat amiably with visitors. Plus,
they talk with amusing accents and party hard, so what’s not to like?
This account focusses on a two-week road trip in October 2007, from St. John’s in the southeast, to L’Anse aux Meadows in the far northwest, the only known Viking settlement in North America. It also features a day hike visit to The Tablelands, a remarkable and majestic geological feature. Even those who don’t normally consider themselves very interested in geology will find themselves awe-struck by these other-worldly landscapes.
This account focusses on a two-week road trip in October 2007, from St. John’s in the southeast, to L’Anse aux Meadows in the far northwest, the only known Viking settlement in North America. It also features a day hike visit to The Tablelands, a remarkable and majestic geological feature. Even those who don’t normally consider themselves very interested in geology will find themselves awe-struck by these other-worldly landscapes.
No comments:
Post a Comment