Friday 5 June 2020

How do you choose the best linear regression model


How do you choose the best linear regression model?

(note: this was also a Quora post that I did some time back)

There are a number of methods and considerations for choosing the “best” regression method, regardless of the software and/or R package you are using. The guidelines below are relevant to R, Python, SPSS or whatever.
 
Some packages are easier than others to use (e.g. economists prefer Stata, social scientists tend to prefer SPSS, lots of people like open source products like R or Python) but they all do much the same thing, in terms of the math. Similarly, R has regression routines in the base package, but there are also lots of specialized packages that have been optimized to make life easier (assuming that the package will load).
Picking the “best” model is really up to the analyst, regardless of the package that is used.
·        What are you predicting? If it is a continuous variable (within some range), then you want to use some form of multiple regression. If it is a probability of some event occurring (e.g. will a student graduate) then you want to use some form of logistic regression. There are also more specialized routines (e.g. hierarchical regression), used for particular purposes.

·        Consider the theory behind your regression model, while choosing dependent variables. This might be derived from previous published research, subject matter experts, or just common sense.

·        You may want to use indicator variables (“dummy variables”) as well as numeric variables. That would be for non-numeric variables such as gender.

·        Different variables can be tested, based on theoretical considerations or exploratory data analysis. Examining scatter plots of the possible independent variables against the dependent can help. Also look at your variables, as to whether they fit the assumptions of regression analysis.

·        The above analysis might lead you to performing data transformations that are needed to linearize some variables (e.g. you might have to test a quadratic) or to ensure that you aren’t violating any assumptions (e.g. a log transform if the data is skewed). Some R packages might do these transformations if you request them, other times you might have to do them in a data step.

·        Once you have settled on a reasonable set of variables, test your model. You will usually be interested in the (adjusted) R-square of the model, and how that changes depending on which variables are in the model. The better the model, the higher the model R-square (i.e. the data fits the model better). You can also use an F-test to see if adding a variable is justified.

.     You will, of course, be interested in which of your variables is statistically significant, but be aware that statistical significance and practical significance are not the same thing.

·
There are a number of standard model building methods, such as forward selection, backward selection or stepwise selection. They all have their advantages and disadvantages (and may go in and out of fashion over time).

·        You also have to look out for multi-collinearity, where two or more independent variables are themselves correlated. There are diagnostics for this (e.g. VIF) as well as indications that point to multi-collinearity (e.g. a variable has a counter-intuitive positive or negative sign).

·        Consider interaction effects, where the effect of a variable on a regression depends on another variable. Keep in mind that models with a lot of interactions (especially 3-way or more) are difficult to interpret and explain to others.

·        Examine outliers to see if there are some that are having an inordinate effect on the regression. There are diagnostics such as DFITS and DFBETAs for that.

.   Also, remember that a statistical model does give some pretty good indication of cause and effect, but it can never be definitive.

 

.     Be careful about extrapolating outside the range of your data.

·        It can be interesting to compare traditional statistical modelling to newer data science techniques. You would expect them to come to similar conclusions, though the “black box” nature of many machine learning techniques can make the comparisons difficult to do.

·        Try not to take criticism of your model personally. There are so many possibilities for a complex model, that there is bound to be disagreement.

·        And always remember the phrase “All models are wrong, but some are useful” (usually attributed to statistician George Box).


-------------------------------------------------------------------------------------------------------------

And of course, you should do some outside reading, to keep your mind fresh.  Like, maybe, the book below (which does feature a statistician, namely me) 😀 :

On the Road with Bronco Billy

What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.
Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.

No comments:

Post a Comment