Appendix A
Experimental vs.
Quasi-Experimental Designs
Many social welfare
programs look successful--to their own staffs as well as
to outsiders--because their clients seem to be doing so
well. For example, a substantial proportion of trainees
may have found jobs after having gone through a
particular program. The question is: Did they get their
jobs because of the program, or would they have done so
anyway? Answering this question is the central task in
evaluating the impact of a program or policy. In other
words, what would have happened to the clients if they
had not been in the program or subject to the policy.
The key task of an
impact evaluation is to isolate and measure the program
or policy's effects independent of other factors that
might be at work, such as local economic conditions, the
characteristics of participants, and the quality of the
particular project's leadership. To do so, researchers
try to establish the "counterfactual"; that is,
they try to see what happened to a similar group that was
not subject to the program or policy.
Researchers use
either experimental or quasi-experimental designs to
establish the counterfactual. After describing both
approaches, this appendix summarizes their principal
strengths and limitations, with illustrations from recent
studies.
Experimental
Designs
Many social
scientists believe that experimental designs are the best
way to measure a program or policy's impact. In an
experimental design, individuals, families, or other
units of analysis are randomly assigned to either a
treatment or control group. The treatment group is
subjected to the new program or policy, and the control
group is not. The experience of the control group, thus,
is meant to represent what would have happened but for
the intervention.
If properly planned
and implemented, an experimental design should result in
treatment and control groups that have comparable
measurable and unmeasurable aggregate characteristics
(within the limits of chance variation). And, from the
moment of randomization, they will be exposed to the same
outside forces, such as economic conditions, social
environments, and other eventsallowing any
subsequent differences in average outcomes to be
attributed to the intervention.
Thus, experimental
designs ordinarily do not require complex statistical
adjustments to eliminate differences between treatment
and control groups. Policymakers can then focus on the
implications of findings, rather than "become
entangled in a protracted and often inconclusive
scientific debate about whether the findings of a
particular study are statistically valid."1 As we will see, the same is
not true for quasi-experiments.
In the last 30
years, experimental designs have been used to evaluate a
wide range of social interventions, including housing
allowances, health insurance reforms, the negative income
tax, and employment and training programs. 2 The evaluations of
welfare-to-work programs conducted by Manpower
Demonstration Research Corporation (MDRC) in the
1980swhich used experimental designsare
widely credited with having shaped the Family Support Act
of 1988. 3 Similarly, in the 1990s, Abt
Associates evaluated the Job Training Partnership Act
(JTPA) program. 4 Its findings, also based on an
experimental design, likewise led to major policy
changes.
Experimental designs
are not without disadvantages, however. They can raise
substantial ethical issues, can be difficult to implement
properly, and cannot be used for certain types of
interventions. 5
Ethical issues
arise, for example, when the treatment group is subjected
to an intervention that may make its members worse off or
when the control group is denied services that may be
beneficial. 6 In the late 1980s, the state of Texas
implemented a random assignment evaluation to test the
impact of 12-month transitional child care and Medicaid
benefits. When the study began, the treatment group was
receiving a benefit (the transitional services) that was
otherwise unavailable. Hence, denying the same benefit to
the control group did not raise an ethical issue. But a
year later, nearly identical transition benefits became
mandatory under the Family Support Act. At that point,
the control group was being denied what had become part
of the national, legally guaranteed benefit package. In
the face of complaints, the secretary of Health and Human
Services required the control group to receive the
benefits, thereby undercutting the experiment.
Sometimes, members
of the program staff object to the denial of services
built into the experimental design. When they view the
experiment as unethical or fear that members of the
control group will complain, they sometimes circumvent
the procedures of the random assignment process, thus
undermining the comparability of the treatment and
control groups. This apparently happened, for example, in
an evaluation of the Special Supplemental Nutrition
Program for Women, Infants, and Children (WIC). 7
Implementation
issues arise in every study. 8 As Rossi and Freeman state,
"the integrity of random assignment is easily
threatened." 9 For example,
"contamination," where control group members
are also subjected to all or some of the intervention,
was a problem in many of the waiver-based state welfare
reform demonstrations. In many states, the new policies
were applicable to all recipients in the state, except
for a small control group (usually drawn from a limited
number of sites). It was not uncommon for members of
these control cases to migrate to other counties and
receive the treatment elsewhere. Metcalf and Thornton
add:
Typical
forms of distortion or "contamination"
include corruption of the random assignment
mechanism, provision of the treatment
intervention to controls despite proper
randomization, other forms of
"compensatory" treatment of controls or
unintended changes in the control group
environment, and distortion of referral flows.10
Statistical
adjustments cannot always deal successfully with such
problems.
In some experiments,
members of either the control or treatment groups may not
fully understand the rules to which they are subject. For
example, in New Jersey's Family Development Program, it
appears that many members of the control group believed
that the family cap policy applied to them, probably
because of the extensive statewide publicity this
provision received.11 Because this may have
affected their behavior, it is unlikely that the impact
of the demonstration can be determined by comparing the
birth rates in the two groups.
In other cases,
treatment group members were not made aware of all the
changes that affected them. For example, in California,
caseworkers did not initially inform recipients of the
state's new work incentive provisions. The impact
findings suggest that the demonstration had little effect
on employment and earnings, but it is unclear whether
this is because the policy was ineffective or because the
recipients were unaware of its provisions.
Attrition from the
research sample and nonresponse are other problems for
experiments, although they can also afflict
quasi-experiments. First, the research sample may become
less representative of the original target population.
For example, in the Georgia Preschool Immunization
Project, the evaluator was only able to obtain permission
to examine the immunization records from about half of
the research sample.12 If the immunization records
for those not included are significantly different from
those for whom data are available, nonresponse bias could
be a problem. Second, if there is differential attrition
between the treatment and control groups, their
comparability is undermined and bias may be introduced.13 This would be especially
problematic if the characteristics of those for whom
immunization records are available differed
systematically in the treatment versus the control group.
For example, it may be that those in the treatment group
who were in compliance were more likely to open their
records to the evaluatorsa reasonable assumption,
since those who are behind may hesitate for fear of being
sanctioned. There was some evidence in the experiment of
such differences, based on the characteristics of the
clients at the time of random assignment. As a result,
the evaluator had to adopt statistical techniques to
control for bias.
There is a
possibility that the welfare reform intervention, itself,
might affect attrition. For example, treatment cases that
lose benefits due to a time limit may move to another
state to regain assistance, but there would be no
corresponding incentive for control cases to do the same.
All of the foregoing
implementation problems, of course, apply to
quasi-experimental designs as well.
Some
interventions are not amenable to randomized experiments.
Experimental designs may not be appropriate for
interventions that have significant entry effects. For
example, a stringent work requirement may deter families
from applying for assistance. This effect may not be
captured in a random assignment evaluation, because it
occurs before the family is randomly assigned.14 Some reforms may have
community-wide effects.15 For example, they may change
the culture of the welfare office, leading caseworkers to
treat all clientstreatment and
controldifferently. Again, this change would not be
captured by a simple treatment-control comparison of
outcomes.
Furthermore, the
random assignment process, itself, can affect the way the
program works and the benefits or services available to
control group members.16 For example, in the National
JTPA evaluation, extensive outreach was necessary because
the assignment of applicants to the control group left
unfilled slots in the program. The applicants brought
into the program were not the same as those who were
effectively "displaced" when assigned to the
control group. Thus, the impacts on those who were in the
program may not correspond to the impacts on those who
would have been in the program in the absence of the
demonstration.
Quasi-Experimental
Designs
When random
assignment is not possible or appropriate, researchers
often use quasi-experimental designs. In
quasi-experiments, the counterfactual is established by
selecting a "comparison" group whose members
are not subject to the intervention but are nevertheless
thought to be similar to the treatment group.
Participants vs.
Nonparticipants. Participants in the program are
compared to nonparticipants with similar characteristics
on the assumption that both groups are affected by the
same economic and social forces. But even though the two
groups may appear similar, they may differ in
unmeasurable, or difficult to measure, ways. For example,
those who voluntarily enroll in a training program may
have more desire to find a job than those who do not.
Alternatively, those who do not enroll may want to work
immediately or in other ways may be in less need of a
training program. Both are possibilities that should be
considered in interpreting a study's results. Statistical
and other methods are sometimes used to control for such
"selection effects," but success in doing so
has been mixed.
The Urban Institute
used this approach to evaluate the Massachusetts
Employment and Training Program (ET) choices.17 It compiled a longitudinal
database of information on about 17,000 AFDC recipients,
of which half participated and half did not beyond
initial registration and orientation. The nonparticipants
served as a comparison group, selected through a
statistical procedure that matched the comparison group
members to the participant group on several measurable
characteristics, including race, sex, age, and family
composition. Some characteristics, such as motivation,
could not be measured. Although the evaluators attempted
to control for "selection bias," the results
are still subject to uncertainty.
Comparison Sites.
Individuals from other geographic areas are compared to
those in the program. This avoids problems of comparing
those who volunteer for a program to those who do not
(selection effect), but creates other complications. In
particular, statistical adjustments are needed for
economic and demographic differences between the sites
that may influence participant outcomes. This method
works best when similar sites are matched, and when the
treatment and comparison sites are selected randomly. But
if sites can choose whether to receive the treatment or
serve as the comparison, selection bias can be a problem.
Also, sites that initially appear well matched may become
less so, for reasons unrelated to the intervention. (Some
events, such as a plant closing, can be especially
problematic.)
In one of the rare
exceptions to its requirement for an experimental design
of waiver-based demonstrations, HHS allowed Wisconsin to
evaluate its "Work Not Welfare" demonstration
using a comparison site approach. However, Wisconsin
selected as treatment sites the two counties that were
most interested (and perhaps most likely to succeed) in
implementing the demonstration. Besides this important
attribute, it turns out that the two counties differed
from others in the state on a number of other dimensions
(for example, they had lower unemployment rates), thus
complicating the analysis. One review of the evaluation
plan concludes:
It is
unlikely, however, that matched comparison
counties and statistical models will adequately
control for the fact that the demonstration
counties were preselected. It may not be possible
to separate the effects of the program from the
effects of being in a county where program staff
and administrators were highly motivated to put
clients to work.18
Pre-Post
Comparisons. Cohorts of similar individuals from
different time periods are compared, one representing the
"pre" period and one the "post"
period. This also requires statistically controlling for
differences between the groups. Using this approach,
several studies examined the impact of the AFDC reforms
in the 1981 Omnibus Budget Reconciliation Act (OBRA).19 One problem with pre-post
evaluations is that external factors, such as changing
economic conditions, may affect the variable of interest,
so that the trend established before the new intervention
is not as good a predictor of what would have otherwise
happened. The evaluation of the 1981 OBRA changes, for
instance, had to control for the 1981 1982 recession,
which was the worst in 45 years. In fact, there are
likely to be many changes and it is difficult to
disentangle the impact of a reform initiative from
changes that may occur in the economy or in other public
policies. For example, studies using this methodology to
examine welfare reform in the 1990s would have to control
for expansion in the Earned Income Tax Credit (EITC) and
the increase in the minimum wage, two important policy
changes that could affect labor market outcomes.
Since there may be
no more than a few years of data on the "pre"
period, the length of follow-up for the "post"
period is limited as well. This may be too short a time
to test the long-term impact of important policy changes,
especially since some changes, such as time limits, may
not be fully effective for many years themselves. It also
may not be possible to obtain data on some outcomes for
the "pre" period, such as measures related to
child well-being, particularly if they are not readily
available on administrative records. In addition,
detailed data on participant characteristics, economic
conditions, and other relevant "control"
variables are needed. For example, New Jersey's first
welfare reform demonstration, the Realizing Economic
Achievement (REACH) program, compared the outcomes of a
cohort of recipients subject to REACH to a cohort of
similar individuals from an earlier period.
Unfortunately, the evaluator concluded that because the
limitations with the historical data were so severe, it
was not possible to draw any definitive conclusions from
the results.20
Another way of
conducting a pre-post comparison is to examine those
participating in the program before and after going
through it. The outcomes for the group in the pre-program
period serve as the comparison "group" for the
same population after the program is implemented. (For
example, the employment and earnings of individuals can
be compared before and after participation in a training
program.)
A major advantage of
this design is that it requires data only on program
participants. Unfortunately, as Rossi and Freeman note:
Although few
designs have as much intuitive appeal as simple
before-and-after studies, they are among the
least valid assessment approaches. The essential
feature of this approach is a comparison of the
same targets at two points in time, separated by
a period of participation in a program. The
differences between the two measurements are
taken as an estimate of the net effects of the
intervention. The main deficiency of such designs
is that ordinarily they cannot disentangle the
effects of extraneous factors from the effects of
the intervention. Consequently, estimates of the
intervention's net effects are dubious at best.21
Comparisons with
Secondary Data Sets. Secondary data sets, such as
the Census Bureau's Survey of Income and Program
Participation (SIPP), or its Current Population Survey
(CPS), or other national or state-level data sources,
have also been used to develop comparison groups. In such
comparisons, a sample of similar persons is identified to
represent what would have happened in the absence of the
intervention. Many evaluations of the Comprehensive
Employment and Training Act (CETA) employed this
approach.22 As with other quasi-experimental
methods, selection bias is a problem, because volunteers
for the program are compared to nonparticipants.
Moreover, complications can arise because the data for
comparison group members derived from such secondary
sources are generally cruder than for the treatment
group.
Time
Series/Cross-Sectional Studies. Time series and
cross-sectional analyses use aggregate data to compare
outcomes either over time or across states (or other
political subdivisions), thus attempting to control for
variables that can affect the outcome of interest,
including a variable that represents the intervention
itself. These methods have been commonly used by
researchers, but are very sensitive to the specification
of the model.23
For example, one
evaluation of the Massachusetts ET program used time
series analysis.24 A host of explanatory variables were
used to reflect the importance of demographic, economic,
and policy factors that would be expected to have an
impact on the caseload, including a variable to measure
the impact of the program being evaluated. The study
found that the ET program did not lead to any significant
reduction in the welfare rolls in Massachusetts, but the
author cautioned:
Analysis of
time series data is often complicated by the fact
that many variables tend to change over time in
similar ways. For this reason, it may be
difficult to separate out accurately the impact
of the different factors. Thus, the estimated
effects of the explanatory variables may be
unstable, changing from one specification of the
model to others.25
As is evident from
the above discussion, a major problem with
quasi-experimental designs is selection bias. This arises
out of processes that influence whether persons are or
are not program participants. Unmeasured differences in
personal characteristics, such as the degree of
motivation, rather than the program itself, could explain
differential outcomes. Sometimes the selection processes
are system characteristics, such as differences among
welfare offices, which lead some to participate in reform
efforts and others not to. Although there are a variety
of statistical techniques to correct for selection bias,
it is impossible to know with certainty which is most
appropriate. And, since these methods result in different
estimates, there is always some uncertainty regarding the
findings of quasi experiments. Here is how Gary Burtless
of the Brookings Institution put it:
Our
uncertainty about the presence, direction, and
potential size of selection bias makes it
difficult for social scientists to agree on the
reliability of estimates drawn from
nonexperimental studies. The estimates may be
suggestive, and they may even be helpful when
estimates from many competing studies all point
in the same direction. But if statisticians
obtain widely differing estimates or if the
available estimates are the subject of strong
methodological criticism, policymakers will be
left uncertain about the effectiveness of the
program.26
With experimental
designs, such adjustments are unnecessary, since random
assignment should equalize the treatment and control
groups in terms of both observable and unobservable
characteristics.
***
Experimental designs
have long been the evaluation method of choice, and
should probably be considered first in any evaluation.
Many current welfare reform efforts, however, are not
amenable to randomized experiments. The new program or
policy may cover the entire state, without provision
having been made for a control group; the changes made by
the state may have affected norms and expectations across
the entire community, sample, or agency, so that the
control group's behavior was also influenced; and there
may be substantial "entry effects," as
described above.
Thus, in many
circumstances, a quasi-experimental design will be the
preferable approach. Although not nearly as problem-free
as experimental designs, they can provide important
information about new policies and programs.
The overriding point
is that welfare reform efforts should be evaluated as
best as possible and the design chosen should be the one
most likely to succeed.
� 1997 by the University of Maryland,
College Park, Maryland. All rights reserved. No part
of this publication may be used or reproduced in any manner
whatsoever without permission in writing from the University of
Maryland except in cases of brief quotations embodied in news
articles, critical articles, or reviews. The views
expressed in the publications of the University of Maryland are
those of the authors and do not necessarily reflect the views of
the staff, advisory panels, officers, or trusties of the
University of Maryland
Back to top