3.6.7 Qualitative Predictors

Here we use the Carseats data, which is included in the ISLP package. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.

In [35]:Carseats=load_data('Carseats')
Carseats.columns
Out[35]:Index(['Sales','CompPrice','Income','Advertising',
'Population','Price','ShelveLoc','Age','Education',
'Urban','US'],
dtype='object')

The Carseats data includes qualitative predictors such as ShelveLoc , an indicator of the quality of the shelving location — that is, the space within a store in which the car seat is displayed. The predictor ShelveLoc takes on three possible values, Bad , Medium , and Good . Given a qualitative variable such as ShelveLoc , ModelSpec() generates dummy variables automatically. These variables are often referred to as a one-hot encoding of the categorical one-hot feature. Their columns sum to one, so to avoid collinearity with an interencoding cept, the first column is dropped. Below we see the column ShelveLoc[Bad] has been dropped, since Bad is the first level of ShelveLoc . Below we fit a multiple regression model that includes some interaction terms.

encoding

In [36]:allvars=list(Carseats.columns.drop('Sales'))
y=Carseats['Sales']
final=allvars+[('Income','Advertising'),
('Price','Age')]
X=MS(final).fit_transform(Carseats)
model=sm.OLS(y,X)
summarize(model.fit())
Out[36]:coefstderrtP>|t|
intercept6.57561.0096.5190.000

3.7 Exercises

CompPrice 0.0929 0.004 22.567 0.000
Income 0.0109 0.003 4.183 0.000
Advertising 0.0702 0.023 3.107 0.002
Population 0.0002 0.000 0.433 0.665
Price -0.1008 0.007 -13.549 0.000
ShelveLoc[Good] 4.8487 0.153 31.724 0.000
ShelveLoc[Medium] 1.9533 0.126 15.531 0.000
Age -0.0579 0.016 -3.633 0.000
Education -0.0209 0.020 -1.063 0.288
Urban[Yes] 0.1402 0.112 1.247 0.213
US[Yes] -0.1576 0.149 -1.058 0.291
Income:Advertising 0.0008 0.000 2.698 0.007
Price:Age 0.0001 0.000 0.801 0.424

In the first line above, we made allvars a list, so that we could add the interaction terms two lines down. Our model-matrix builder has created a ShelveLoc[Good] dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLoc[Medium] dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLoc[Good] in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLoc[Medium] has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location, but lower sales than a good shelving location.

서브목차