Wednesday, 2 February 2022

Running a dummy variable regression in Excel

 

How do you run dummy variable regression in Excel?


I don’t believe that Excel has an automatic way to do this, unlike some stats packages. So, you have to do a bit of data preparation yourself.

  • If the variable that interests you has N categories (say, 3 regions for example), then you need N-1 categories (for the example, that would be 2).

  • Choose a reference category, which will be the category that doesn’t actually get a dummy-coded variable. Which category you choose might depend on your research/business question. If there is no compelling reason to do otherwise, pick the category with the most observations.

  • In some part of your worksheet, copy over your original data, but insert new columns for the dummy variables. So, in the example case, set up two new variables. You can drop the original variable from this copy dataset, once you have assigned the dummy variables, as you won’t use it in the regression.

  • Code those new variables (in this case 2 new variables) as 1 or 0, depending on whether the case satisfies the condition for the new variable. For example if your old variable was region (East, Central, West), code the dummy variable for East as 1 if the observation is from the eastern region.

  • A useful convention is to start your dummy variable names with the letter d, and something descriptive about the variable. So, assuming “East” had the most observations in the original categorical variable, you would have two new variables D_Central and D_West. There would be no D_East, as that is your reference case variable.

  • A common way to recode the dummy variables is with an “if” function. So, in D_West code if(Region=”West”, 1,0). “Region” would refer to the cell which had the value for Region in your original dataset.

  • Once you have set that up, either use the regression functions in Excel or use the regression options in the Data Analysis package in Excel. If you don’t see that, load it from the “add-ins” menu.

  • To interpret the output, the coefficient for the dummy variables gives the strength of that variable compared to the reference variable, on the value of your output variable.

  • If you computed a point estimate with you regression, when both dummy variables are 0, that gives the point estimate for a case from the East region. If you assign the value 1 to one of the dummy variables, then that gives the value for a case from that region. You can only assign the value 1 to one of the dummy variables, the other has to be 0 (you can’t be in two regions at the same time).

You can use these dummy variables to include interaction effects, by including some more data transformation steps. If your dataset is fairly large, or as your model starts to get pretty complicated, you might want to use a stats program like SPSS or R. The later is a little tricky to learn, but it is open source and thus free. These more purely statistically-oriented software products will sometimes have options or packages that automate a lot of the data transformations (though doing it yourself really helps you to understand the underlying ideas).

The usual rules about evaluating models apply. The XKCD comic has a amusing example of model evaluation.



No comments:

Post a Comment