How do you choose the best linear regression model?
(note: this was also a Quora post that I did some time back)
There are a number of methods and considerations for choosing the “best” regression method, regardless of the software and/or R package you are using. The guidelines below are relevant to R, Python, SPSS or whatever.
Some
packages are easier than others to use (e.g. economists prefer Stata, social
scientists tend to prefer SPSS, lots of people like open source products like R
or Python) but they all do much the same thing, in terms of the math.
Similarly, R has regression routines in the base package, but there are also
lots of specialized packages that have been optimized to make life easier
(assuming that the package will load).
Picking
the “best” model is really up to the analyst, regardless of the package that is
used.
·
What are
you predicting? If it is a continuous variable (within some range), then you
want to use some form of multiple regression. If it is a probability of some
event occurring (e.g. will a student graduate) then you want to use some form
of logistic regression. There are also more specialized routines (e.g.
hierarchical regression), used for particular purposes.
·
Consider
the theory behind your regression model, while choosing dependent variables.
This might be derived from previous published research, subject matter experts,
or just common sense.
·
You may
want to use indicator variables (“dummy variables”) as well as numeric
variables. That would be for non-numeric variables such as gender.
·
Different
variables can be tested, based on theoretical considerations or exploratory
data analysis. Examining scatter plots of the possible independent variables
against the dependent can help. Also look at your variables, as to whether they
fit the assumptions of regression analysis.
·
The above
analysis might lead you to performing data transformations that are needed to
linearize some variables (e.g. you might have to test a quadratic) or to ensure
that you aren’t violating any assumptions (e.g. a log transform if the data is
skewed). Some R packages might do these transformations if you request them,
other times you might have to do them in a data step.
·
Once you
have settled on a reasonable set of variables, test your model. You will
usually be interested in the (adjusted) R-square of the model, and how that
changes depending on which variables are in the model. The better the model,
the higher the model R-square (i.e. the data fits the model better). You can
also use an F-test to see if adding a variable is justified.
. You will, of course, be interested in which of your variables is statistically significant, but be aware that statistical significance and practical significance are not the same thing.
·
There are
a number of standard model building methods, such as forward selection,
backward selection or stepwise selection. They all have their advantages and
disadvantages (and may go in and out of fashion over time).
·
You also
have to look out for multi-collinearity, where two or more independent
variables are themselves correlated. There are diagnostics for this (e.g. VIF)
as well as indications that point to multi-collinearity (e.g. a variable has a
counter-intuitive positive or negative sign).
·
Consider
interaction effects, where the effect of a variable on a regression depends on
another variable. Keep in mind that models with a lot of interactions
(especially 3-way or more) are difficult to interpret and explain to others.
·
Examine
outliers to see if there are some that are having an inordinate effect on the
regression. There are diagnostics such as DFITS and DFBETAs for that.
. Also, remember that a statistical model does give some pretty good indication of cause and effect, but it can never be definitive.
. Be careful about extrapolating outside the range of your data.
·
It can be
interesting to compare traditional statistical modelling to newer data science
techniques. You would expect them to come to similar conclusions, though the
“black box” nature of many machine learning techniques can make the comparisons
difficult to do.
·
Try not
to take criticism of your model personally. There are so many possibilities for
a complex model, that there is bound to be disagreement.
·
And
always remember the phrase “All models are wrong, but some are useful” (usually
attributed to statistician George Box).
And of course, you should do some outside reading, to keep your mind fresh. Like, maybe, the book below (which does feature a statistician, namely me) 😀 :
On the Road with Bronco Billy
What follows is an account of a ten day
journey through western North America during a working trip, delivering lumber
from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment.
The writer had the opportunity to accompany a friend who is a professional
truck driver, which he eagerly accepted. He works as a statistician for the
University of Alberta, and is therefore is generally confined to desk, chair,
and computer. The chance to see the world from the cab of a truck, and be
immersed in the truck driving culture was intriguing. In early May 1997 they
hit the road.
Some time has passed since this journal was
written and many things have changed since the late 1990’s. That renders the
journey as not just a geographical one, but also a historical account, which I
think only increases its interest.
We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.
The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.
We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.
The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.
Amazon
U.S.: http://www.amazon.com/gp/product/B00X2IRHSK
Amazon
U.K.: http://www.amazon.co.uk/gp/product/B00X2IRHSK
Amazon
Germany: http://www.amazon.de/gp/product/B00X2IRHSK
Amazon France: https://www.amazon.fr/dp/B00X2IRHSK
Amazon Spain: https://www.amazon.es/dp/B00X2IRHSK
Amazon Italy: https://www.amazon.it/dp/B00X2IRHSK
Amazon Netherlands: https://www.amazon.nl/dp/B00X2IRHSK
Amazon Japan: https://www.amazon.co.jp/dp/B00X2IRHSK
Amazon Brazil: https://www.amazon.com.br/dp/B00X2IRHSK
Amazon
Canada: http://www.amazon.ca/gp/product/B00X2IRHSK
Amazon Mexico: https://www.amazon.com.mx/dp/B00X2IRHSK
Amazon Australia: https://www.amazon.com.au/dp/B00X2IRHSK
Amazon India: https://www.amazon.in/dp/B00X2IRHSK
No comments:
Post a Comment