A relational database is a much more efficient means to store and maintain information that the flat two-dimensional tables of a statistical package. The latest editions of SPSS allow the user to easily construct SPSS data files from many relational databases. This chapter will explore relational databases and then show how to use them to construct SPSS data files.
An education researcher has been given a grant to study student attitudes toward learning as a function of student attributes, teacher attributes and teaching environment. Student attributes include such variables as general intelligence, socio-economic status, parental involvement, motivation, absences, sex, age, temperament, and involvement with athletics. Teacher attributes include variables such as age, sex, marital status, educational level, experience, motivation, temperament, and leadership characteristics. Teaching environment includes variables such as class size, lighting, comfort, available technology, and age of building.
The researcher gives an attitude survey consisting of fifteen questions that are answered on a scale of 1=strongly disagree to 5=strongly agree. The items are to be combined into three scales, one of nine items, one of four items, and one of 2 items. The researcher wants to construct a prediction model using multivariate statistical methods to predict student attitudes on the three scales from the various student, teacher, and environmental attributes.
The researcher collects data from every student in every class in five schools. If there is an average of 500 students at each school with 6 classes each, the data file will consist of 5 * 500 * 6 = 15,000 entries or rows in the data file. If there were twenty-five attribute variables and fifteen survey variables, the total number of entries in the data matrix would be 15,000*(25+15)=600,000. It is going to take a small army of data collectors and data entry operators to create this data file.
The researcher can automate some of the data entry. The survey items can be collected on computer forms that can be read directly into the computer. The students could also enter a code for the teacher and the classroom on the form. It might be difficult or impossible, however, for the student to enter complete information about the teacher, classroom, and personal variables. How many students have an accurate memory of their absences? How many know their general intelligence? This information might exist in other data sources, but the student is not going to be able to enter this on the computer form.
The researcher recognizes that there is a great deal of duplication in the additional information that must be entered in the variables in to complete the data file. Each classroom, for example, should have the same information entered as environmental variables. The same should happen for each teacher. Because the teachers move around to different classrooms in the schools, the teacher and environmental variables are not completely confounded. Data entry could be further automated using "cut and paste" methods, but these can be tedious and better methods exist to construct this type of data file.
The end result of the process is a data table that appears as follows:
The problem is that the SPSS data file is a flat, two-dimensional, table. Each row is an observation and each column is a variable. This works fine for analyzing data, but it is an inefficient way of storing redundant information.
A relational database is a series of two-dimensional tables that are linked by an index or key variable. The key allows complex, redundant, two-dimensional tables to be constructed on-the-fly. This system allows the researcher to construct four non-redundant tables, each containing information about one aspect of the study. The tables will be combined into a single table and used as input into the statistical package.
Any number of different relational database programs would work for the researcher. Microsoft AccessTM has been selected because it is widely available. It is not a "heavy-duty" database in that you might not want to set up a critical e-commerce business using it, but it is more than adequate for our purpose.
Often the relational database that contains the information will have been constructed and maintained by a third party. The job of the researcher is to access and join the tables within the statistical package. Creating a simple relational database in Access is fairly easy and may provide insight into how relational databases function, so before learning how to access a database within a statistical package, a demonstration database will be constructed.
Open the Microsoft Access program and when the screen shown in the following figure appears, select Blank Access database and then click OK.
On the next screen, select the folder to save the database using the Save in: box. Name the database (in the File name: box) Schools.
The database is now created, but contains no information. Four database tables will need to be created, one for each specific type of information. The first table will contain information about the teachers. After you clicked Create in the last section, the dialog box shown in the following figure should be open on your screen.
Double-click on Create table in Design View. When the design view table appears, enter the information shown in the following figure:
Note: When your cursor is placed in the Data Type column, click the small arrow at the right of the cell to view a dropdown list of possible data types. Select Number for this exercise.
When you have entered all of the Table1 information, right-click the TeacherID field name and select Primary Key from the pop-up menu (shown in the following figure) to make this field the primary key.
When finished, change from Design View" to Datasheet View by clicking the View icon in the left-hand top of the screen, or by selecting Datasheet View from the View menu.
The computer will ask if you want to save the table before changing views.
Click Yes, and a Save As dialog box (shown in the following figure) appears. Type Teacher in the Table Name: box and click OK.
Information may be added to the table in the Datasheet View. A new row is created when information is entered in any of the cells. The example table contains information for five teachers, shown in the following figure. Enter this information into your Teacher table.
When you've entered all the information to the Teacher table, close the table by clicking the X in the upper right-hand corner of the Teacher:Table screen. The Schools:Database screen reappears, and another table can be created.
Since our example database requires four tables, the previous procedure must be repeated three more times. First, create the Student table. The Design View is shown in the following figure. Enter the information from it into your new table:
The Student table is constructed with the data shown in the following figure (note that the StudentID would most likely be the student's Social Security number- the examples are not reasonable). Enter the information in your Student table.
The next repetition creates the Classroom table. Enter the information shown in the next two figures into your next table.
Finally, you need to create the Questionnaire table. Most likely this table would be populated using some kind of automation, such as scantron sheets or by entering information on forms from a Web page. This table is the main table and contains references to the key variable on the other tables. For example, an entry of "2" in the TeacherID variable refers to the teacher identified as "2" in the TeacherID variable in the Teacher table. The variables do not need to have similar names in order to refer to each other, only similar values. An entry called formID is used as the primary key because all other variables are not unique, that is, the same StudentID may (and probably will) appear more than once in this column. The Design View of the Questionnaire table appears in the following figure. Enter this information for your Questionnaire table.
The datasheet view of the Questionnaire table is shown in the following figure. Enter the date in your table.
At this point the relational database is complete for the purposes of generating an SPSS data file. The database can either be closed or remain open for the rest of the exercise.
Rather than creating or reading an existing SPSS data file, the researcher is going to create a data file by combining the tables in the relational database. To open a relational database as a data file, select File/Open Database/ New Query, as the following figure shows.
The Database Wizard screen, shown in the following figure, should appear.
After the database is added, it will appear in the list of possible data sources. The first time, however, requires that the database be added as a data source.
In the Database Wizard's opening screen, click Add Data Source. The ODBC Data Source Administrator dialog box should appear. Click on the System DSN tab, and the System DSN page of the dialog box opens. It will look similar to the one in this figure.
Click the Add... button, and the Create New Data Source dialog box appears. First, select a driver for your data source: since the relational database was created using Microsoft Access, choose Microsoft Access Driver from the list of names, as shown in this figure:
Click the Finish button, and the ODBC Microsoft Access Setup dialog box appears (see the following figure).
The name that is entered in the Data Source Name box will appear in the list of possible data sources. Our researcher entered SchoolAttitudes in that box, and then entered Attitudes toward School in the Description box. The Description is optional, but it is always a good idea to document thoroughly. The next step is to select the database; click the Select button to proceed.
The Select Database dialog box opens. In the Directories (c:\databases) box, select the folder in which you saved your database. Doubleclick on the name of the database-Schools.mdb, in this case. The screen should look something like the following figure:
Click OK. The Database Wizard welcome screen reappears, and your data source (SchoolAttitudes) should now be added to the list, as this figure shows.
Remember these steps need only be done once.
Select SchoolAttitudes from the data source list, and click the Next button.
In Step 2 of the Database Wizard, click and drag all four tables from the Available Tables box on the left to the Retrieve Fields In This Order box on the right. You will use them all in the construction of the data file. The resulting screen should appear something like the following figure:
Click Next, and Step 3 of the Database Wizard appears. The relationships between the tables need to be specified. Four boxes corresponding to the four tables in the database should appear (see the following figure). The variables in each table are listed. To specify a relationship, link two tables by dragging one variable name in one table to the corresponding variable name in the other. In this case, each of three variable names in the Questionnaire table is linked to similar variable name in one of the other three tables:
Note that ClassroomID is linked with Classroom, demonstrating that the variables do not necessarily have to share the same name.
Click the Finish button, and the data file magically appears in SPSS, as shown in the following figure.
This data file can be cleaned up by deleting unneeded variables, adding documentation in value and variable labels, and changing variable names. In all respects it can be treated like any other SPSS data file.
A relational database has many advantages over flat data tables when dealing with complex data. In this chapter you constructed a simple database and then combined it into a single data table in SPSS. The purpose of this exercise was to illustrate the power of using relational databases as a source of data for statistical analysis. You should be aware, however, that the power of these techniques has only been slightly revealed. When large databases from different sources are combined, powerful results are possible.