3.6.7 Qualitative Predictors

Here we use the Carseats data, which is included in the ISLP package. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.

In [35]:Carseats=load_data('Carseats')
Carseats.columns

Out[35]:Index(['Sales','CompPrice','Income','Advertising',
'Population','Price','ShelveLoc','Age','Education',
'Urban','US'],
dtype='object')

The Carseats data includes qualitative predictors such as ShelveLoc , an indicator of the quality of the shelving location — that is, the space within a store in which the car seat is displayed. The predictor ShelveLoc takes on three possible values, Bad , Medium , and Good . Given a qualitative variable such as ShelveLoc , ModelSpec() generates dummy variables automatically. These variables are often referred to as a one-hot encoding of the categorical one-hot feature. Their columns sum to one, so to avoid collinearity with an interencoding cept, the first column is dropped. Below we see the column ShelveLoc[Bad] has been dropped, since Bad is the first level of ShelveLoc . Below we fit a multiple regression model that includes some interaction terms.

encoding

In [36]:allvars=list(Carseats.columns.drop('Sales'))
y=Carseats['Sales']
final=allvars+[('Income','Advertising'),
('Price','Age')]
X=MS(final).fit_transform(Carseats)
model=sm.OLS(y,X)
summarize(model.fit())
Out[36]:coefstderrtP>|t|
intercept6.57561.0096.5190.000

3.7 Exercises

`CompPrice`	`0.0929`	`0.004`	`22.567`	`0.000`
`Income`	`0.0109`	`0.003`	`4.183`	`0.000`
`Advertising`	`0.0702`	`0.023`	`3.107`	`0.002`
`Population`	`0.0002`	`0.000`	`0.433`	`0.665`
`Price`	`-0.1008`	`0.007`	`-13.549`	`0.000`
`ShelveLoc[Good]`	`4.8487`	`0.153`	`31.724`	`0.000`
`ShelveLoc[Medium]`	`1.9533`	`0.126`	`15.531`	`0.000`
`Age`	`-0.0579`	`0.016`	`-3.633`	`0.000`
`Education`	`-0.0209`	`0.020`	`-1.063`	`0.288`
`Urban[Yes]`	`0.1402`	`0.112`	`1.247`	`0.213`
`US[Yes]`	`-0.1576`	`0.149`	`-1.058`	`0.291`
`Income:Advertising`	`0.0008`	`0.000`	`2.698`	`0.007`
`Price:Age`	`0.0001`	`0.000`	`0.801`	`0.424`

In the first line above, we made allvars a list, so that we could add the interaction terms two lines down. Our model-matrix builder has created a ShelveLoc[Good] dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLoc[Medium] dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLoc[Good] in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLoc[Medium] has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location, but lower sales than a good shelving location.

서브목차