Multiple Regression with Categorical Variables

Psychotherapy

Acupuncture

When a researcher wishes to include a categorical variable with more than two level in a multiple regression prediction model, additional steps are needed to insure that the results are interpretable. These steps include recoding the categorical variable into a number of separate, dichotomous variables. This recoding is called "dummy coding." In order for the rest of the chapter to make sense, some specific topics related to multiple regression will be reviewed at this time.

The Multiple Regression Model

Multiple regression is a linear transformation of the X variables such that the sum of squared deviations of the observed and predicted Y is minimized. The prediction of Y is accomplished by the following equation:

The "b" values are called regression weights and are computed in a way that minimizes the sum of squared deviations.

Dichotomous Predictor Variables

Categorical variables with two levels may be directly entered as predictor or predicted variables in a multiple regression model. Their use in multiple regression is a straightforward extension of their use in simple linear regression. When entered as predictor variables, interpretation of regression weights depends upon how the variable is coded. If the dichotomous variable is coded as 0 and 1, the regression weight is added or subtracted to the predicted value of Y depending upon whether it is positive or negative. If the dichotomous variable is coded as -1 and 1, then if the regression weight is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression weight is negative, then addition and subtraction is reversed. Dichotomous variables can be included in hypothesis tests for R² change like any other variable.

Testing for Blocks of Variables

A block of variables can simultaneously be entered into an hierarchical regression analysis and tested as to whether as a whole they significantly increase R², given the variables already entered into the regression equation. The degrees of freedom for the R² change test corresponds to the number of variables entered in the block of variables.

Adding variables to a linear regression model will always increase the unadjusted R² value. If the additional predictor variables are correlated with the predictor variables already in the model, then the combined results are difficult to predict. In some cases, the combined result will provide only a slightly better prediction, while in other cases, a much better prediction than expected will be the outcome of combining two correlated variables.

If the additional predictor variables are uncorrelated (r = 0.0) with the predictor variables already in the model, then the result of adding additional variables to the regression model is easy to predict. Namely the R² change will be equal to the correlation coefficient squared between the added variable and predicted variable. In this case it makes no difference what order the predictor variables are entered into the prediction model. For example, if X₁ and X₂ were uncorrelated (r₁₂ = 0) and r_1y² = .3 and r_2y²= .4, then R² for X₁ and X₂ would equal .3 + .4 = .7. The value for R² change for X₂ given X₁ was in the model would be .4. The value for R² change for X₂ given no variable was in the model would be .4. It would make no difference at what stage X₂ was entered into the model, the value for R² change would always be .4. Similarly, the R² change value for X₁ would always be .3. Because of this relationship, uncorrelated predictor variables will be preferred, when possible.

Q7.1

Dummy coding involves
recoding a categorical variable with more than two levels into a number of dichotomous variables.
transforming a continuous variable, such as age, into a number of discrete units.
recoding a number of categorical variables into a single continuous variable.
reversing the order that the independent variables appear in a multiple regression model.

Q7.2

When performing an hierarchical regression analysis, one of the degrees of freedom for an R squared change hypothesis test is
the number of variables entered in a block.
the number of scores.
the number of scores minus two.
the mean square change divided by the mean square error.

Q7.3

When performing an hierarchical regression analysis, adding a single independent variable that is uncorrelated with all other independent variables will result in an R square change of
the correlation coefficient squared between the added variable and dependent variable.
the square root of the standard error of estimate.
the standardized regression coefficient for that variable.
the squared unstandardized regression coefficient for that variable.

Q7.4

When possible, the statistician generally prefers
uncorrelated predictor variables.
collinear predictor variables.
heterosecdasticity.
a large number of predictor variables.

Example Data

Faculty Salary Simulated Data
Faculty	Salary	Gender	Rank	Dept	Years	Merit
1	38	0	3	1	0	1.47
2	58	1	2	2	8	4.38
3	80	1	3		9	3.65
4	30	1	1	1	0	1.64
5	50	1	1	3	0	2.54
6	49	1	1	3	1	2.06
7	45	0	3	1	4	4.76
8	42	1	1	2	0	3.05
9	59	0	3	3	3	2.73
10	47	1	2	1	0	3.14
11	34	0	1	1	3	4.42
12	53	0	2	3	0	2.36
13	35	1	1	1	1	4.29
14	42	0	1	2	2	3.81
15	42	0	1	2	2	3.84
16	51	0	3	2	7	3.15
17	51	1	2	1	8	5.07
18	40	0	1	2	3	2.73
19	48	1	2	1	1	3.56
20	34	1	1	1	7	3.54
21	46	1	2	1	2	2.71
22	45	0	1	2	6	5.18
23	50	1	1	3	2	2.66
24	61	0	3	3	3	3.7
25	62	1	3	1	2	3.75
26	51	0	1	3	8	3.96
27	59	0	3	3	0	2.88
28	65	1	2	3	5	3.37
29	49	0	1	3	0	2.84
30	37	1	1	1	9	5.12

It is fairly clear that Gender could be directly entered into a regression model predicting Salary, because it is dichotomous. The problem is how to deal with the two categorical predictor variables with more than two levels (Rank and Dept).

Categorical Predictor Variables

Dummy Coding - making many variables out of one

because categorical predictor variables cannot be entered directly into a regression model and be meaningfully interpreted, some other method of dealing with information of this type must be developed. In general, a categorical variable with k levels will be transformed into k-1 variables each with two levels. For example, if a categorical variable had six levels, then five dichotomous variables could be constructed that would contain the same information as the single categorical variable. Dichotomous variables have the advantage that they can be directly entered into the regression model. The process of creating dichotomous variables from categorical variables is called dummy coding.

Depending upon how the dichotomous variables are constructed, additional information can be gleaned from the analysis. In addition, careful construction will result in uncorrelated dichotomous variables. As discussed earlier, these variables have the advantage of simplicity of interpretation and are preferred to correlated predictor variables.

Dummy Coding with three levels

The simplest case of dummy coding is when the categorical variable has three levels and is converted to two dichotomous variables. For example, Dept in the example data has three levels, 1=Family Studies, 2=Biology, and 3=Business. This variable could be dummy coded into two variables, one called FamilyS and one called Biology. If Dept = 1, then FamilyS would be coded with a 1 and Biology with a 0. If Dept=2, then FamilyS would be coded with a 0 and Biology would be coded with a 1. If Dept=3, then both FamilyS and Biology would be coded with a 0. The dummy coding is represented below.

Three Variables Dummy Coded Variables as 0 and 1
	Dept	FamilyS	Biology
Family Studies	1	1	0
Biology	2	0	1
Business	3	0	0

Using SPSS to Dummy Code Variables

The dummy coding can be done using SPSS and the Transform/RecodeInto different Variable... options. The Dept variable is the "Numeric Variable" that is going to be transformed. In this case the FamilyS variable is going to be created. The window on the screen should appear as follows:

Clicking on the Change button and then on the Old and New Values... button will result in the following window:

The Old Value is the level of the categorical variable to be changed, the New Value is the value on the transformed variable. In the example window above, a value of 3 on the Dept variable will be coded as a 0 on the FamilyS variable. The Add button must be pressed to add the recoding to the list. When all the recodings have been added, click on the Continue button and then the OK button.

The recoding of the Biology is accomplished in the same manner. A listing of the data is presented below.

The correlation matrix of the dummy variables and the Salary variable is presented below.

Two things should be observed in the correlation matrix. The first is that the correlation between FamilyS and Biology is not zero, rather it is -.474. Second is that the correlation between the Salary variable and the two dummy variables is different from zero. The correlation between FamilyS and Salary is significantly different from zero.

The results of predicting Salary from FamilyS and Biology using a multiple regression procedure are presented below. The first table enters FamilyS in the first block and Biology in the second. The second table reverses the order that the variables are entered into the regression equation. The model summary tables are presented below.

In the first table above both FamilyS and Biology are significant. In the second, only FamilyS is statistically significant. Note that both orderings end up with the same value for multiple R (.604). It makes a difference what order the variables are entered into the regression equation in the hierarchical analysis.

In the next tables, both FamilyS and Biology have been entered in the first block. The model summary table, ANOVA, and Coefficients tables are presented below.

The ANOVA and model summary tables contain basically redundant information in this case. The Coefficients table can be interpreted as Biology making 8.886 thousand dollars less in salary per year relative to the Business department, while the Family Studies department make 12.350 thousand dollars less than the Business department. Note that the "Sig." levels in the "Coefficients" table are the same as the significance levels of the model summary tables presented earlier when each of the dummy coded variables is entered into the regression equation last.

The results of the preceding analysis can be compared to the results of using the ANOVA procedure in SPSS with Salary as the dependent measure and Dept as the independent. The following table presents the table of means and ANOVA table.

Note first that the ANOVA tables produced using the ANOVA command and the Linear Regression command are identical. ANOVA is a special case of linear regression when the variables have been dummy coded. The second notable comparison of the tables involves the regression weights and the actual differences between the means. Note that the regression weight for FamilyS in the regression procedure is -12.350 and the difference between the means of the Family Studies department (42.25) and the Business department (54.60) is -12.350.

Q7.5

To use a categorical variable with 8 levels in a multiple regression model, it would be necessary to create __ dummy coded variables
2
7
8
9

Q7.6

ANOVA is a special case of
discriminant function analysis.
simple linear regression.
multiple regression.
regression to the mean.
a t-test.

Q7.7

If a researcher is interested only in the overall gain in predictive power of a categorical variable with three or more levels when using a multiple regression model,
any system of dummy codes would work.
the dummy codes must be carefully constructed so that they are uncorrelated with each other.
the best-fitting system of dummy codes must be discovered by trial and error.
the dummy codes will correspond the standardized regression coefficients.

Q7.8

When the dummy codes are correlated with each other,
it makes a difference what order the variables are entered into the regression equation in the hierarchical analysis.
the correlation matrix of the dummy codes will have values of either one or zero.
the standard error of estimate will equal the square root of one minus the multiple R squared.
hypothesis testing of R squared change will generally produce anomalous results.

Q7.9

Which of the following values would most likely be different from the others
the significance level for a dummy coded variable on the coefficients table.
the significance level for R squared change for a block of dummy coded variables.
the significance level in the ANOVA table using the SPSS ANOVA command.
the significance level in the ANOVA table of a regression model for a block of dummy coded variables.

Dummy Coding into Independent Variables

Selection of an appropriate set of dummy codes will result in new variables that are uncorrelated or independent of each other. In the case when the categorical variable has three levels this can be accomplished by creating a new variable where one level of the categorical variable is assigned the value of -2 and the other levels are assigned the value of 1. The signs are arbitrary and may be reversed, that is, values of 2 and -1 would work equally well. The second variable created as a dummy code will have the level of the categorical variable coded as -2 given the value of 0 and the other values recoded as 1 and -1. In all cases the sum of the dummy coded variable will be zero. Trust me, this is actually much easier than it sounds.

Interpretation is straightforward. Each of the new dummy coded variables, called a contrast, compares levels coded with a positive number to levels coded with a negative number. Levels coded with a zero are not included in the interpretation.

For example, Dept in the example data has three levels, 1=Family Studies, 2=Biology, and 3=Business. This variable could be dummy coded into two variables, one called Business (comparing the Business Department with the other two departments) and one called FSvsBio (for Family Studies versus Biology.) The Business contrast would create a variable where all members of the Business Department would be given a value of -2 and all members of the other two departments would be given a value of 1. The FSvsBio contrast would assign a value of 0 to members of the Business Department, 1 divided by the number of members of the Family Studies Department to member of the Family Studies Department, and -1 divided by the number of members of the Biology Department to members of the Biology Department. The FSvsBio variable could be coded as 1 and -1 for Family Studies and Biology respectively, but the recoded variable would no longer be uncorrelated with the first dummy coded variable (Business). In most practical applications, it makes little difference whether the variables are correlated or not, so the simpler 1 and -1 coding is generally preferred. The contrasts are summarized in the following table.

Orthogonal dummy coded Variables.
	Dept	Business	FSvsBio
Family Studies	1	1	1/N₁ = 1/12= .0833
Biology	2	1	-1/N₂ = -1/7 = -.1429
Business	3	-2	0

The correlation matrix containing the two contrasts and the Salary variable is presented below.

Note that the correlation coefficient between the two contrasts is zero. The correlation between the Business contrast and Salary is -.585 with a squared correlation coefficient of .342. This correlation coefficient has a significance level of .001. The correlation coefficient between the FSvsBio contrast and Salary is -.150 with a squared value of .023.

In this case entering Business or FSvsBio first makes no difference in the results of the regression analysis.

Entering both contrasts simultaneously into the regression equation produces the following ANOVA table.

Note that this table is identical to the two ANOVA tables presented in the previous section. It may be concluded that it does not make a difference what set of contrasts are selected when only the overall test of significance is desired. It does make a difference how contrasts are selected, however, if it is desired to make a meaningful interpretation of each contrast.

The coefficient table for the simultaneous entry of both contrasts is presented below.

Note that the "Sig." level is identical to the value when each contrast was entered last into the regression model. In this case the Business contrast was significant and the FSvsBio contrast was not. The interpretation of these results would be that the Business Department was paid significantly more than the Family Studies and Biology Departments, but that no significant differences in salary were found between the Family Studies and Biology Departments.

By carefully selecting the set of contrasts to be used in the regression with categorical variables, it is possible to construct tests of specific hypotheses. The hypotheses to be tested are generated by the theory used when designing the study.

Q7.10

Dummy codes are sometimes called
contrasts.
uncorrelated regression weights.
between subjects analytical analysis.
principal components.

Q7.11

When a set of dummy codes are orthogonal,
they will be uncorrelated with each other.
they will describe interaction effects.
the order the variables are entered into the regression equation is critical.
interpretation of the results of the analysis becomes much more difficult.

Q7.12

The selection of a set of contrasts makes a difference
if it is desired to make a meaningful interpretation of each contrast.
if only the overall test of significance is desired.
in the significance levels of the R squared change.
in categorical variables with seven or more levels.

Q7.13

In general, the set of contrasts selected and tested
are generated by the theory used when designing the study.
are generated using step-up or step-down regression procedures.
are generally similar for any analysis.
are selected to result in the largest R squared change.

Categorical Predictor Variables with Six Levels

If a categorical variable had six levels, five dummy coded contrasts would be necessary to use the categorical variable in a regression analysis. For example, suppose that a researcher at a headache care center did a study with six groups of four patients each (N is being deliberately kept small). The dependent measure is subjective experience of pain. The six groups consisted of six different treatment conditions.

The six treatment conditions of the second example.
Group	Treatment
1	None
2	Placebo
3	Psychotherapy
4	Acupuncture
5	Drug 1
6	Drug 2

An independent contrast is a contrast that is not a linear combination of any other set of contrasts. Any set of independent contrasts would work equally well if the end result was the simultaneous test of the five contrasts, as in an ANOVA. One of the many possible examples is presented below.

Application of this dummy coding in a regression model entering all contrasts in a single block would result in an ANOVA table similar to the one obtained using Means, ANOVA, or General Linear Model programs in SPSS.

This solution would not be ideal, however, because there is considerable information available by setting the contrasts to test specific hypotheses. The levels of the categorical variable generally dictate the structure of the contrasts. In the example study, it makes sense to contrast the two control groups (1 and 2) with the other four experimental groups (3, 4, 5, and 6). Any two numbers would work, one assigned to groups 1 and 2 and the others assigned to the other four groups, but it is conventional to have the sum of the contrasts equal to zero. One contrast that meets this criterion would be (-2, -2, 1, 1, 1, 1).

Generally it is easiest to set up contrasts within subgroups of the first contrast. For example, a second contrast might test whether there are differences between the two control groups. This contrast would appear as (1, -1, 0, 0, 0, 0). A third contrast might compare non-drug vs. rug treatment groups, groups 3 and 4 vs. groups 5 and 6 (0, 0, 1, 1, -1, -1). As can be seen, this would be a contrast within the experimental treatment groups. Within the non-drug treatment, a contrast comparing Group 3 with Group 4 might be appropriate (0, 0, 1, -1, 0, 0). Within the drug treatment conditions, a contrast comparing the two drug treatments would be the last contrast (0, 0, 0, 0, 1, -1). Combined, the contrasts are given in the following table.

Orthogonal dummy codes for six groups.
	Group	C1	C2	C3	C4	C5
None	1	-2	1	0	0	0
Placebo	2	-2	-1	0	0	0
Psychotherapy	3	1	0	1	1	0
Acupuncture	4	1	0	1	-1	0
Drug 1	5	1	0	-1	0	1
Drug 2	6	1	0	-1	0	-1

The following table presents example data and dummy coded contrasts for this hypothetical study.

The correlation matrix of the five contrasts and the pain variable is presented below.

Note that the correlation coefficients between the five contrasts are all zero. This occurs because all groups have an equal number of subjects.

Using pain as the dependent variable and the five contrasts as the independent variables, the regression results tables entering all variables in block 1 are presented below.

Of major interest is the "Sig." column on the "Coefficients" table. Note that all contrasts are statistically significant except C5. This can be interpreted as: (1) The treatment conditions were more effective than the control conditions, (2) the two control conditions significantly differed from one another, with placebo more effective than control (3) The drug groups were more effective in reducing pain than the non-drug conditions (4) Acupuncture was significantly more effective than Psychotherapy (5) the two drug treatments were not significantly different from one another.

The output from the "General Linear Model, Simple Factorial" program in SPSS is presented below.

Note that it is for practical purposes identical to the ANOVA table produced using the multiple regression program with the dummy coded contrasts. In effect what the General Linear Model program does is to automatically select a set of contrasts and then perform a regression analysis with those contrasts. The General Linear Model program allows the user to specify a special set of contrasts so that an analysis like the one done with dummy coding of contrasts in multiple regression might be performed. It is left for the reader to explore SPSS for this ability.

Q7.14

It is conventional to construct contrasts with a sum equal to
zero.
one.
N.
the number of levels of the categorical variable.

Q7.15

When using a set of orthogonal contrasts, the correlation coefficient between the contrast and the dependent variable will equal
the standardized regression coefficient.
zero.
one.
the square root of one minus the multiple R squared.

Q7.16

The General Linear Model program in SPSS
automatically selects a set of contrasts.
requires the users to enter a set of contrasts.
has no flexibility with respect to the contrasts selected.
is much more difficult to use than the multiple regression program with dummy coded variables.

Combinations of Categorical Predictor Variables

In the original example data set for this chapter there were three obvious categorical variables, Gender, Rank, and Dept. Gender could be directly entered into the regression model. After dummy coding into two contrasts each, Rank and Dept could be directly entered into the regression model. Difficulties arise, however, when combinations of these categorical variables must be considered. For example, consider Gender and Dept. Rather than two groups and three groups, this combination of categorical variables must be considered as six groups, Male Family Studies, Female Family Studies, Male Biology, Female Biology, Male Business, and Female Business. Dummy coding these data would require five dummy coded contrasts. Three exist, one for Gender and two for Dept, but there is no accounting for the two additional contrasts. They will be the focus of the next topics, interaction effects.

Equal Sample Size

Because everything works out much cleaner when equal sample sizes are assumed, this case will be presented first. The example data set has been reduced to twelve subjects, two for each combination of Gender and Dept. The reduced data set is presented below.

Data matrix for combinations of categorical predictor variables.
Faculty	Salary	Gender	Dept
7	45	0	1
11	34	0	1
14	42	0	2
15	42	0	2
9	59	0	3
12	53	0	3
4	30	1	1
10	47	1	1
8	42	1	2
2	58	1	2
5	50	1	3
6	49	1	3

Data matrix for combinations of categorical predictor variables as a single group.
Salary	Gender	Dept	Group
45	0	1	1
34	0	1	1
42	0	2	2
42	0	2	2
59	0	3	3
53	0	3	3
30	1	1	4
47	1	1	4
42	1	2	5
58	1	2	5
50	1	3	6
49	1	3	6

The situation is now analogous to the earlier case when the categorical variable had six levels.

Main Effects

A categorical variable with six levels can be dummy coded into five contrasts. The first three contrasts have already been discussed. The first of these contrasts will compare males with females and will comprise the Gender Main Effect. The next two will compare the salaries of the three departments over levels of gender and will be called the Department Main Effect. The dummy codes for these main effects are presented below.

Orthogonal contrasts for main effects for data matrix for combinations of categorical predictor.
Salary	Group	Gender Main Effect	Department Main Effect
45	1	1	1	1
34	1	1	1	1
42	2	1	1	-1
42	2	1	1	-1
59	3	1	-2	0
53	3	1	-2	0
30	4	-1	1	1
47	4	-1	1	1
42	5	-1	1	-1
58	5	-1	1	-1
50	6	-1	-2	0
49	6	-1	-2	0

This is basically the same coding as discussed earlier, except it is simplified because of the equal number of subjects in each cell. It will later be demonstrated that the correlation coefficients between these dummy coded variables is zero.

Interaction Effects

Two additional dummy coded variables are needed to account for the categorical variable. These contrasts will comprise the Interaction Effect. In this case the easiest way to find the needed contrasts is to multiply the dummy coded contrast for gender times the dummy coded contrasts for Department. This has the result of changing the sign of the department contrasts for one gender but not the other. The results of this operation appear below.

Orthogonal contrasts for main and interaction effects for data matrix for combinations of categorical predictor.
Salary	Group	Gender Main Effect	Department Main Effect		Interaction Effect
45	1	1	1	1	1	1
34	1	1	1	1	1	1
42	2	1	1	-1	1	-1
42	2	1	1	-1	1	-1
59	3	1	-2	0	-2	0
53	3	1	-2	0	-2	0
30	4	-1	1	1	-1	-1
47	4	-1	1	1	-1	-1
42	5	-1	1	-1	-1	1
58	5	-1	1	-1	-1	1
50	6	-1	-2	0	2	0
49	6	-1	-2	0	2	0

Note that the contrasts all have a correlation coefficient of zero among themselves. The contrasts will be entered into the regression equation predicting salary in three blocks. The first block will contain C1, the second will contain C2 and C3, while the third will contain C4 and C5. The results of this analysis are presented below.

The value for "F Change" and "Sig. F change" is different, however, because different error terms are employed in each case. In this subset of the data, none of the contrasts are significant. The interpretation of the main effects and interaction effects will be the topic of discussion of the next chapter.

Unequal Sample Size

Equal sample size is seldom achieved in the real world, even in the best-designed experiments. Unequal sample size makes the effects no longer independent. This implies that it makes difference in hypothesis testing when the effects are added into the model, first, middle, or last.

The same dummy coding that was applied to equal sample sizes will now be applied to the original data with unequal sample sizes. The simplest way to do this is to recode GENDER into C!, DEPARTMENT into C2 and C3, and compute C4 and C5 by multiplying corresponding contrasts into the new contrast. For example, C4 could be created by multiplying C1 * C2 and C5 could be created by multiplying C1 * C3. The data and dummy coded contrasts appear below.

Note that the correlation coefficients between the contrasts are not zero. This has the effect of changing the value of R²Change for a term depending upon when that term was entered into the model. This is illustrated by entering the two contrasts associated with Dept (C2 and C3) first, second, and last.

Main Effects of Dept Entered First

Main Effects of Dept Entered Second

There are two different ways in which the main effect of Dept may be entered second in the regression model. The first is after Gender and is presented below.

As can be seen, the value of R² change for adding C2 and C3 changes only slightly from .379 to .376. A slightly greater change in R² change value is observed if the interaction contrasts (C4 and C5) are entered before the main effect of department.

Note that the value of R² change is greater for Gender (C1) if it is entered last, rather than first.

Main Effects of Dept Entered Third

Note that the value of R² change is only changed slightly depending upon when it was entered into the model. The pattern of results of the significance tests would not change.

Main Effect of Gender Given Rank, Dept, Gender X Rank, Gender X Dept, Years, Merit

The dummy coded contrasts can be used like any other variables in a multiple regression analysis. In order to find the significance of the effect of Gender given Rank, Dept, Gender X Rank, Gender X Dept, Years, and Merit, the Rank and Gender X Rank effects must be created as dummy coded contrasts. In the following data file the Rank main effect consists of two contrasts: C2a contrasting Full professors with Assistant and Associate professors and C3a contrasting Assistant with Associate professors. The Gender X Rank interaction contrasts (C4a and C5a) are constructed by multiplying the Gender contrast (C1) times the two contrasts for the main effect for Rank.

Contrast coding for rank, gender, and rank X gender interaction.
Gender	Rank	C1	C2a	C3a	C4a	C5a
0	1	-1	1	1	-1	-1
0	2	-1	1	-1	-1	1
0	3	-1	-2	0	2	0
1	1	1	1	1	1	1
1	2	1	1	-1	1	-1
1	3	1	-2	0	-2	0

The additional dummy coded variables are added to the data file in the following.

Salary is predicted in six blocks (only two are really needed) in the following multiple regression analysis. In a simplified analysis, the first block would contain all variables except Gender (C1) and the second would contain only Gender (C1).

As can be seen, the R² change for Gender has increased to a value of .120 which is significant. The value of multiple R is not really 1.000, but very high, close to 1.000. For that reason the error variance is extremely small, resulting in significant effects. This illustrates the problem of fitting too few data points with too many parameters.

If all the effects mentioned above are entered into the model in a single block, the coefficients table appears as follows.

A has been described earlier, the "Sig." column is the significance level of that variable if it is entered last in the regression model. Since t² = F, it is noted that 77.205² is equal to 5960.619, within rounding error. In this case, every variable except C4 and Years is statistically significant.

The alert reader has probably noted that other interaction terms could be created and entered into the regression model. For example, four dummy coded contrasts could be created such that a Rank X Dept interaction could be found. Multiplying this by the Gender contrast (C1) would result in a three-way Gender X Rank X Dept interaction.

Q7.17

When two categorical variables are used in a multiple regression model, the total number of groups will be
the number of levels of the first categorical variable times the number of levels of the second.
the number of levels of the first categorical variable plus the number of levels of the second.
the number of subjects divided by the degrees of freedom.
the number of levels of the first categorical variable times the number of levels of the second minus the number of levels of the first categorical variable plus the number of levels of the second.

Q7.18

When dummy coding for two or more categorical variables, interaction contrasts can be found by
multiplying the dummy codes for the main effects.
only by trial and error.
by reversing the signs of the dummy codes for the main effects and then adding them together.
by reordering the levels within the variables.

Q7.19

When there are equal sample sizes in each combination of two categorical variables the order of entry of blocks of main effects and interaction effects
will result in similar R squared changes.
will result in similar significance levels for R squared changes.
will result in different R squared changes.
will result in a different total multiple R.

Q7.20

When there are unequal sample sizes in each combination of two categorical variables the order of entry of blocks of main effects and interaction effects
will result in similar R squared changes.
will result in similar significance levels for R squared changes.
will result in different R squared changes.
will result in a different total multiple R.

Q7.21

In most real-life regression analyses with two categorical variables one may expect to find
unequal sample sizes.
equal sample sizes.
proportional sample sizes.
no missing data.

Q7.22

When fitting too many parameters with too few data points
the value of the unadjusted multiple R will be close to 1.0.
interaction effects will seldom be significant.
the standard error of estimate will become very large.
collinearity will be a problem.

ANOVA using General Linear Model in SPSS

Although the dummy coding of variables in multiple regression results in considerable flexibility in the analysis of categorical variables, it can also be tedious to program. For this reason most statistical packages have made a program available that automatically creates dummy coded variables and performs the appropriate statistical analysis. In most cases the user is unaware of the calculations being performed in the computer program. This is the case with the General Linear Model program in SPSS.

This program is selected in SPSS by Analyze/General Linear Model/GLM - General Factorial.... To perform the Gender by Department analysis discussed earlier in this section, enter Salary as the dependent measure and Gender and Dept as fixed factors. The screen should appear as follows.

Note that the "F" column and "Sig." column is identical to the results of the R² change analysis presented earlier in this chapter if each of the effects is entered last. This is the meaning of the default "Type III Sum of Squares."

The interpretation of "effects," the result of the dummy coding of categorical variables, is the subject of the next chapter.

Q7.23

Which of the following statements is true
Multiple Regression is a special case of ANOVA.
Any independent system of dummy codes will work equally well in the overall ANOVA.
Dummy codes are unnecessary if the categorical variables are uncorrelated.
Dummy codes are a special case of suppressor variables.

Q7.24

Changing the order that main and interaction effects are entered into a multiple regression equation will have an effect on the R2 change values if
the contrasts are uncorrelated
the cell sizes are not equal
suppressor variables are entered first
within groups mean square variance is greater the 37.54

Q7.25

In a study with two categorical variables, A and B, if A had 5 levels and B had 4 levels, how many contrasts would be necessary to explain the variance predicted by the combinations of these variables
15
19
20
45

Q7.30

The music and crying baby were more disruptive than the other conditions
true
false
cannot tell from the given data

Summary

This chapter discussed how categorical variables with more than two levels could be used in a multiple regression prediction model. The procedure is called dummy coding and involves creating a number of dichotomous categorical variables from a single categorical variable with more than two levels. The text showed how any number of different coding systems would result in similar overall statistical decisions. It was also argued that some coding systems contain greater information about specific statistical decisions and are to be preferred over coding systems that provide less information. The usefulness of one type of coding system, that of main and interaction effects, was demonstrated when there were combinations of categorical variables. The similarity of this system of dummy coding and multifactor ANOVA was demonstrated.