This handbook is about how to plan, conduct, analyze, and write a statistics project. The end goal is a cohesive, understandable paper using statistics as a tool to convey technical information to the reader. Designed for both the undergraduate and graduate student, it is intended as supplementary material for a statistics course. A knowledge of this material alone would not be enough to conduct and write a statistics project, just as knowledge of the standard introductory text is often not enough to use statistics.
Most everyone in the social sciences will probably collect and present statistical data at some point in their lives. Money and jobs are sometimes dependent on the ability to communicate the necessity for social services to a community; such as the need for drug education, medical care for the aged, or guidance counselors in primary schools. At other times the evaluation of existing programs provides important information to aid in making decisions about the future direction of the program.
All data has a story to tell. In doing the project, the students must first find the story in the data and then figure out how to best tell it to others. The discovery process may involve much analysis that is never presented in the final paper. The story telling may lead down dead end paths until the story is presented in such a way as to be clear to the reader. Writing a good technical paper using statistical information is not an easy task, but becomes easier with practice.
The main purpose of the project is to learn how to plan, collect, organize, and write using statistical information. In actually attempting to write using statistics, a potential writer becomes acutely aware of the problems that all statistical writers must face. Problems such as missing data, measurement issues, accurate reporting and recording, depth of analysis, and finding cognitively simple ways of presenting complex results must be faced. The writer becomes a more critical consumer of other statistical information by attempting to do a project.
The project is not intended to be an experimental study involving the manipulation of variables by the experimenter, but rather an observational or field study. As such it would often be performed before an experimental study; just as this course generally precedes a course in experimental methods.
Because not many statistical manipulations are appropriate for extremely small samples, a minimum of twenty records (rows of data) is required. In earlier editions of the handbook an upper limit of 100 records was specified as the maximum number of records in order to allow students to focus on writing the report and not on collecting data. Since large data sets are now easily available and current computers have no difficulty in dealing with them, the upper limit is no longer necessary. The only requirement is that the project is submitted by the various due dates and completed by the end of the semester. The main focus should be on finding and telling the story of the data and not on entering data. In general, the more data the better, but at least 30-40 records are necessary for reasonable analysis.
Between ten and twenty variables or measures per record will define the limits of an acceptable project. The measures may be of any form: nominal, ordinal, interval, or ratio. Any fewer than ten measures generally will result in a project without enough substance, while more than twenty measures becomes unwieldy and may require more time to analyze than is availabe to students.
The project is characterized by six major steps: (1) selection of a central idea and population of interest, (2) selecting and measuring the variables of interest, (3) sampling from the population, (4) collection of data, (5) data analysis - finding the story, and (6) writing the results in a understandable form - telling the story. The first four steps will be presented in the remaining part of this chapter. The last two steps are presented in separate chapters.
Planning research is an extremely important first step. The requirement that the project have a central idea, or goal, is necessary to insure that the project has some direction and to limit the scope of the project. If left unchecked it is easy for projects to "blow up" and become so involved in side issues that the initial reason for the project is lost. A central idea holds a project together and reduces the possibility that a researcher will amass a large amount of data hoping that the data will mean something. The "hit and miss" researchers rarely "hit" and often end up wishing that some more time and thought had been put into plans. When I do statistical consulting for a thesis or dissertation my first request is "tell me in fifty words or less what your dissertation is about." This should be the first sentence in your project.
For some individuals, selecting a topic is the simplest part of the project; for others it can be exceedingly difficult. The topic may be chosen by circumstances such as a supervisor who needs particular information. For instance, if you were working for a radio station and needed to collect information about your listening audience to satisfy FCC regulations or if you were an army captain and wished to preselect enlistees who are likely to pass basic training, you might already know your topic. However, for those who have no idea for a topic, the following three rules for topic selection might be of value.
One common mistake made on statistics projects is that the problem was too complex at the beginning. In most cases projects become more complex as they progress. If a complex topic is chosen initially, the project soon goes beyond any hope of interpretation in the allotted time. This is not to say that the statistics project must be trivial. Simplifying too much may result in a complete project that shows nothing. This error, however, is made much less frequently than the complexity error. I always tell students that they can continue working on a complex project after the course is complete, but they need to finish it by the end of the semester. One of the main things I look for in a project proposal is whether it is set at the correct level.
In most projects, beginning and otherwise, there are many emotional ups and downs associated with completion. In order to keep going when enthusiasm wanes, the project must retain some intrinsic interest to the researcher. If initial interest is low, the project may come to a complete halt when the problems associated with the research become apparent to the researcher.
A second advantage of an interesting topic is that something is generally known about it before beginning. If a topic is selected about which the researcher is familiar and wishes to become more knowledgeable, the project becomes an educational experience rather than simply learning how to do statistics. If a topic is interesting to the researcher it will probably also be so for the reader.
Some topics which are of initial interest must necessarily be rejected because either it would be too difficult or expensive to find objects or people to measure. Expensive is used in this context to mean not only money, but time and effort. On the other hand, if the objects or people are available, it may be difficult or impossible to adequately measure the desired attributes. Sometimes the decision becomes whether to expend the effort needed to plan a study when it might not work out, or to select a secondary topic of less interest, but which appears to be much easier to realize. The ability to know when to face problems and when to drop a topic is perhaps one attribute which distinguishes a good researcher from a poor one. This is another thing I look for in a project proposal. In order to give feedback, I need a fairly detailed description of what your data will look like. Since this information will also be part of the final project, it is well worth spending time on it for the proposal.
Where in the past data was a relatively scarce commodity, with the internet of things data is everywhere. For example, the apparatus I wear on my wrist collects data on heart rate, sleep, and steps I take. Combined with daily entries on food and calorie intake, weight, and types of exercise, it produces more readily available data than I could ever use for a project. Another personal example is my ancestry tree on Ancestry.com. I have data on many of my ancestors that includes gender, dates of birth, marriage, and death, number of siblings, number of children, whether the ancestor is on my mother's or father's side of the family, residence, and number of generations removed from me. It could result in an interesting project, at least to me.
Since many of my students are employed in the medical profession, often data is readily available that can provide information to allow more informed decisions on the job. For example, emergency rooms collect data on the time it takes to obtain service, type of emergency, shift, how the patient arrived, type of insurance, etc. Often this type of data, if sanitized by removing any personal information, can be used for both a statistics project and a later report presented to the place of employment. This dual use is encouraged, although it is always a good idea to let supervisors know and approve of what you are doing before you begin. Even better is when supervisors are involved in the planning process and have a stake in the outcome.
Some examples of projects which have worked in the past for student researchers are:
These examples are not meant to limit the types of topics which may be explored, but show some possibilities.
After the central theme or topic has been selected the next step in the research process is to determine the essential structure of the phenomena of interest. This decision is necessarily based on the beliefs of the researcher about the structure of the world. Research proceeds not by randomly selecting attributes from a list and hoping to discover important relationships, but by systematically exploring critical relationships. Theory is defined as "a set of interrelated constructs (concepts), definitions, and propositions that presents a systematic view of phenomena by specifying relations among variables, with the purpose of explaining and predicting the phenomena" (Kerlinger,1964, p.11). All research proceeds from theory whether stated or not, because certain attributes are examined and others are not.
Much of science is concerned with formalizing theory and generating testable hypotheses about the world. If the hypotheses are confirmed through research, support is garnered for the theory. If not, the theory must be discarded or, as is more often the case, changed. The theory may be disconfirmed but never "proven." That is, it is always possible to generate similar hypotheses from a different theory. A theory, however, may be more or less useful depending on its complexity (the simpler the better) and its explanatory power (the more the better).
At the student project level it is not necessary to be overly concerned with derivation of hypotheses from theory. Some reflection, however, may lead to insight about the nature of your own beliefs and selection of more interesting attributes to measure. The variables selected for inclusion in the project should be measures of the important attributes in the theory. While not the main focus of the project, the introduction is an important component of a good project. Starting with the central theme, the introduction should lead the reader to understand why you included your selected variables in your project. I'm not looking for a thorough review of all the relevant literature, but at least give the reader some sense of direction of why you selected the variables in your project.
Another aspect of theory is the generation of hypotheses about what you expect to find with your analysis. Even without a formal explicit theory, generation of hypotheses can sometimes expose and implicit theory. In any case, to be fair, the hypotheses must be stated before the data analysis. This is another thing I look for in a project proposal.
Although much has been written about how to construct attitude scales and techniques of psychological measurement (Torgerson, 1967), the following are presented as general guidelines to follow in measuring the attributes you have selected for your project.
I like the finer things in life.
Strongly Disagree Disagree No Opinion Agree Strongly Agree
a more discriminating question might be:
I would rather buy fine clothes than books.
Strongly Disagree Disagree No Opinion Agree Strongly Agree
How much do you like to do things by yourself?
very much often sometimes never
or you could add up the responses to the following three questions:
I like to do things by myself.
very much often sometimes never
I would like to live on a commune.
very much often sometimes never
I feel people should rely on themselves rather than expect the government to provide them a living.
Strongly Disagree Disagree No Opinion Agree Strongly Agree
Taxes are too high. Yes No
would probably elicit very few "no" answers. Even if the scale were broken down further into "strongly agree, agree, etc.", one would expect very few responses below "agree."
For example:
I like wide tires on my car. SD D NO A SA
would provide different information than actually measuring the width of the tires on a person's car. Depending upon the attribute being measured, either of the above may be appropriate, but the latter is more objective.
An example questionnaire, dealing with the role of the student in the university, is presented below. Both the questions dealing with student roles and student control will be treated as interval data in later analysis.
Role of the Student Survey
Recalling one of the basic functions of statistics, that of inferring from a sample to a population, the researcher is next faced with the dual problems of defining the population of interest and sampling from that population.
Defining the population of interest is generally fairly straightforward after the central theme is specified. By asking the question, "Who or what do I want to know the information about?" the range of possible populations is specified. A researcher may define the population more or less broadly depending on the kind of sample that is available and the amount of risk he or she is willing to take. An inference from a sample to a population is better or worse depending upon 1) the size of the sample and 2) the representativeness of the sample.
As discussed in the chapter on the sampling distribution, the larger the sample, the more confidence one has in inferring from the sample to the population. In addition, more detailed analysis may be performed when the sample is larger. In general, then, the larger the sample the better. Practical considerations such as time, expense, and availability are limiting factors in determining the size of the sample for a project.
An assumption of a random sample provides the theoretical underpinnings for most of the inferential statistics presented in this text. In practice, however, a truly random sample is rarely practical and seldom accomplished. In psychology in many cases a representative sample, such as students who volunteer and show up for a psychological experiment, is considered acceptable for statistical purposes. The student in doing a project may have difficulty even obtaining a representative sample and, therefore, a random or representative sample is not required for the project. Requesting on Facebook that your friends complete a survey you created would result in a rather strange population. Given that not everyone would respond to your request would result in an even stranger sample. The point being that while such a sampling method is acceptable for your project, you need to be clear about what you did. This has large implications when you discuss the meaning of your results in the discussion section. A fairly detailed description of how you are going to collect your data should be part of your project proposal.
The non-representativeness of the sample is a possible criticism of any inference from a sample to population. If a sample is not representative it will be necessary for the researcher to consider the possible effects of this upon the results. One possibility is to redefine the population such that the sample may be considered representative. Another alternative is to consider the possible effects and present them in the discussion section.
The definition of a population and selection of a sample from this population will give the researcher some practical application of these concepts. Procedures in practice are often found to be somewhat different than the theory presented in introductory statistics texts.
Before beginning the actual collection of the data it will be necessary to obtain approval from individuals or committees responsible for protecting human subjects. In many cases a proposal must be submitted and approved. In preparing the proposal the following questions can be used as guidelines for the data collection procedure.
If so, what kind of procedure is available for protection against this danger?
Before participating the subject must be given information about the nature of the research and informed that participation is voluntary. The instructions or cover sheet must also inform the subjects that they may quit at any time during the procedure. An example cover sheet is given in below.
A signed statement to this effect is sometimes obtained for protection of the researcher. The signing of a statement, however, may jeopardize procedures to insure anonymity. A statement to the effect that by filling out the questionnaire the subject is giving permission to use the data has found approval with the local committees.
This is a controversial topic among researchers. On one hand the instructions of the cover sheet must give a reasonably clear and descriptive account of the data collection procedure to allow for informed consent. On the other hand, the results can possibly be biased by giving the participants information about the hypotheses under study. In some cases mild deception is valuable and acceptable in that it may disguise the true purpose of the study. Any deception beyond this will probably be unacceptable and should be avoided.
In general, names of participants are not recorded. If names are necessary for possible comparisons with data collected at a later date, they should be kept locked and physically separated for the data collection instruments. Folding completed questionnaires and then placing them in a box or envelope rather than handing them to the researcher also maintains a greater degree of anonymity.
The procedure for collecting the data can markedly affect the results. For example, a researcher studying extramarital affairs might get quite different results depending upon whether he or she had a direct interview in the presence of the spouse, the presence of peers, or alone. A confidential questionnaire might produce even different results.
The actual collection of the data is perhaps the simplest part of the procedure, yet unforeseen problems may arise. The procedure may prove embarrassing to the people being questioned, they might not answer accurately, or the data might prove too costly to collect in terms of time or money. In any case, a pilot instrument is strongly recommended to anticipate and avoid such problems. The easiest way to pilot test a data collection instrument is to give it to a friend who is willing to fill it out in your presence and identify any question which is unclear.
The actual data collection may be in the form of a questionnaire, interview, or naturalistic observation. Each of the proceeding methods has advantages and disadvantages. An interview may produce more relevant, detailed, and specific information than a questionnaire if the interviewer is skilled and enough time is taken. A questionnaire may produce the desired information with much less trouble and in a more standardized fashion. The choice of method depends largely upon the topic.
Using appropriate existing data found either as a result of a web search or because others have already collect and recorded it is acceptable as long as it has not been analyzed and reported in the manner you are going to do. The requirement of a description of the collection method should be part of both your proposal and final project. If using a web resource, links are necessary.
Any method of data collection is acceptable for the project, the only requirement being that the procedure is ethically sound, clearly described, and the possible effects of the data collection method are considered when interpreting the results of the study.
Proposal - Subjects | Too many subjects will make the project difficult to complete in a semester. | Correct number of subjects for the project. | Too few subjects for an acceptable project. | The number of subjects was not given in the proposal. |
Proposal - Procedure | The procedure is clearly described and acceptable | The procedure is clearly described and potential issues are present | The procedure is partially spelled out but more information is needed. | The procedure is inadequately described. |
Proposal - Survey Instrument | The survey instrument, if needed, is shown as an Appendix. | The survey instrument is described, but not shown. | This survey instrument is neither shown nor described. | A survey instrument is not needed for this project. |
Proposal - Variables | The proposal describes between 10 and 20 variables. | The proposal has more than 20 variables | The proposal has fewer than 10 variables | It is impossible to determine how many variables this project will have. |
Proposal - Hypotheses | The hypotheses are clearly stated and testable. | The hypotheses are reasonable. | The hypotheses are questionable. | No hypotheses are discussed. |
Proposal - Quality | A quality proposal - please proceed | Some minor issues with the proposal - see text. | Some major issues with the proposal - see text. | Please resubmit a project proposal - see text. |
Central Theme | was clearly and succinctly stated and included in the first paragraph | was presented | was implied, but not clearly stated | ??? |