Evaluating the Evaluations
As the foregoing
discussion of studies suggests, the next few years will
witness a veritable flood of new evaluation reports. The
total body of research will be large, complex, and likely
to lead to diverse and contradictory findings.
The
Need
Many of the
evaluations will provide important information about the
impact of the new welfare regime on individuals and
institutions. They will identify the difficulties and
successes that states have had in implementing their
reforms, and estimate the impacts of such reforms on the
well-being of the poor, especially on their children.
These findings, in turn, can help policymakers choose
between various program approaches. For example, after
MDRC documented the apparent success of "labor force
attachment strategies" in reducing welfare
caseloads, many states adopted them.
However, many of the
evaluations will have such serious flaws that their
utility will be sharply limited. For example, because of
design and implementation problems, no one may ever know
whether New Jersey's "family cap" had any
impact on the birth rates of mothers on welfare.
(Recently, two outside experts reviewed the evaluation of
New Jersey's Family Development Program, which included a
family cap provision. They concluded that there were
serious methodological flaws in the evaluation, so an
interim report was not released.)
Evaluations can go
wrong in many ways. Some have such obvious faults that
almost anyone can detect them. Other flaws can be
detected only by experts with long experience and high
levels of judgment.
The "100-hour
rule" demonstrations are an example of the need for
the expert review of evaluations. The AFDC-UP Program
(abolished by TANF) provided benefits to two-parent
families if the principal earner had a significant work
history and worked less than 100 hours per month. Because
this latter requirement, the so-called "100-hour
rule," was thought to create a disincentive for
full-time employment, the FSA authorized a set of
experiments to alter the rule. Three states (California,
Utah, and Wisconsin) initiated demonstrations to evaluate
the impact of softening the rule.
Findings from these
evaluations suggest that eliminating the rule for current
recipients had little impact on welfare receipt,
employment, or earnings. But in a recent review, Birnbaum
and Wiseman identified many flaws in these studies.1 First, random assignment
procedures were undermined in all three states, so the
treatment and control groups were not truly comparable.
Second, the states did a poor job of explaining the
policy change to the treatment group, limiting its impact
on client behavior. Third, some outcomes, such as those
related to family structure, were poorly measured.
The proper use of
these forthcoming evaluations requires the ability to
distinguish relevant and valid findings from those that
are not. This does not mean that studies must be perfect
in order to be useful. Research projects entirely without
flaws do not exist and, arguably, never will.
Almost every
evaluation is compromised by programmatic, funding, time,
or political constraints. No program has been implemented
with absolute fidelity to the original design. No
sampling plan has ever been without faults. Some
observations and data are missing from every data set.
Analytical procedures are always misspecified to some
degree. In other words, evaluation findings are only more
credible or less so, and even poorly designed and
executed evaluations can contain some information worth
noting.
Devolution has
further increased the need for careful, outside reviews
of research findings. Previously, the federal government
required a rigorous evaluation in exchange for granting
state waivers, and federal oversight of the evaluations
provided some quality control. In keeping with the new
welfare law's block-grant approach, the federal
government's supervision of the evaluations of
state-based welfare initiatives will be curtailed: States
are no longer required to evaluate their reforms and, if
they do, they can choose any methodology they wish.
Already, there are
indications that state discretion under TANF will lead to
a proliferation of evaluation designs, some rigorous but
many not. As Galster observes, "Many state agencies
either lack policy evaluation and research divisions
altogether, or use standards for program evaluation that
are not comparable to those set by their federal
counterparts. The quantity and quality of many
state-initiated evaluations of state-sponsored programs
may thus prove problematic."2
The number of
studies purporting to evaluate welfare reform will grow
rapidly in the years to come. The challenge facing
policymakers and practitioners will be to sort through
the many studies and identify those that are credible. It
is a task that will be complicated by the volume and
complexity of the studies, and the highly charged
political atmosphere that surrounds them.
Tension is already
building between the conservative lawmakers responsible
for crafting the welfare bill and the predominantly
liberal scholars involved in monitoring and evaluating
it. Many of the researchers now studying the effects of
the welfare law were also vocal critics of it. For
example, the Urban Institute's $50 million project to
assess the "unfolding decentralization of social
programs" is being conducted by the same
organization whose researchers, in a highly controversial
study, claimed that the new law would push 2.6 million
people, including 1.1 million children, into poverty.3
This has caused some
conservatives in Congress to worry that
"pseudo-academic research" will unfairly
portray the effects of the welfare overhaul.4 Undoubtedly, some on the
left as well as the right will misuse or oversimplify
research findings to their own advantage, but even the
perception of bias can limit the policy relevance of
research. Good research should be identified, regardless
of the ideological stripes of its authors.
Review
Criteria
The key issue is the
extent to which a discerned fault reduces the credibility
of a study. Unfortunately, most policymakers and
practitioners are ill-equipped to judge which faults are
fatal, especially since they often must act before the
traditional scholarly process can filter out invalid
results. This is understandable, since assessing
evaluation studies often requires both detailed knowledge
of the programs involved and a high level of technical
expertise.
To help them better
assess this research and glean the lessons it offers,
this paper also describes and explains the generally
accepted criteria for judging evaluations. The criteria,
of course, are not equally applicable to all evaluations.
Program
"Theory". Underlying every program's
design is some theory or model of how the program is
conceived to work and how it matches the condition it is
intended to ameliorate. An evaluation of the program
should describe the underlying social problem it is
intended to address and how the causal processes
described in the model are expected to achieve program
goals. Hence, a critical issue in assessing evaluations
is the adequacy of program models.
Special problems are
presented by reforms that have several goals. Many of the
waiver-based experiments are intended to achieve diverse
objectives, such as increasing work effort and promoting
stable families, and, thus, involve multiple
interventions. Sometimes the processes can work at cross
purposes, placing conflicting incentives on clients. For
example, many states have simultaneously expanded
earnings disregards and imposed strict time limits. As a
result, families that go to work may be able to retain a
modest cash grant as a result of the liberalized
treatment of earnings, but if they want to conserve their
time-limited benefits, they may choose not to take
advantage of this incentive. Examination of program
theory can reveal such conflicts and identify potential
unwanted side effects.
In assessing the
adequacy of an evaluation's program theory, questions
such as the following should be raised:
- Is there an
adequate description of the underlying social
problem the intervention is meant to address?
- Does the
intervention make sense in light of existing
social science theory and previous evaluations of
similar interventions?
- Are the
hypothesized causal processes by which the reform
effort is intended to achieve its goals clearly
stated?
- Have potential
unwanted side effects been identified?
Research
Design. An evaluation's research design is
crucial to its ability to answer, in credible ways,
substantive questions about program effectiveness. There
are two central issues in research design: (1) "internal
validity," or the ability to rule out
alternative interpretations of research findings; and (2)
"external validity," or the ability to
support generalizations from findings to larger
populations of interest.
For example, an
evaluation that is based solely on measures of client
employment levels taken before and after a reform is
instituted lacks strong internal validity because any
observed changes in employment levels cannot be uniquely
attributed to the reform measures. Similarly, an
implementation study of one welfare office in a state
system with scores of such offices is of limited external
validity because the office studied may not fairly
represent all the others.
The effectiveness of
a program is measured by comparing what happens when a
program is in place to what happens without the program,
the "counterfactual." A critical issue is how
the evaluation is designed to estimate this difference.
In this respect,
randomized experimental designs are considered to be
superior to other designs. (Experimental and
quasi-experimental designs are discussed in Appendix A.)
In a randomized experiment, individuals or families (or
other units of analysis) are randomly assigned to either
a treatment group to whom the program is given or a
control group from whom the program is withheld. If
properly conducted, random assignment should result in
two groups that, initially, are statistically comparable
to one another. Thus, any differences in outcomes between
the groups can be attributed to the effects of the
intervention with a known degree of statistical
precision. Random assignment rules out other possible
influences, except for the intervention itself, and
therefore has strong internal validity.
Although random
assignment is usually the most desirable design, it is
not always feasible, especially when a program enrolls
all or most of its clientele. Quasi-experimental designs
are then employed. They rely on identifying a comparison
group with characteristics similar to those of the
treatment group, but from another geographic area or time
period or otherwise unexposed to the new policy. In some
cases, the outcomes of those subject to a new welfare
policy may be compared before and after exposure to the
new policy.
The major difficulty
with quasi-experimental designs is that the members of
comparison groups may differ in some unmeasured or
undetectable ways from those who have been exposed to the
particular program or intervention. Typically,
quasi-experimental designs employ statistical analyses to
control for such differences, but how well this is done
is open to debate. As a result, their internal validity
is not as strong as with randomized experiments. Judging
the strength of an evaluation design's internal validity
should be an issue at the center of any assessment.
External validity is
also crucial for policy purposes. Even an extremely
well-designed evaluation with high internal validity is
not useful to policymakers if its findings cannot be
extrapolated to the program's total clientele.
In large part, an
evaluation's external validity depends on how the
research population is selected. In many of the
waiver-based welfare reform demonstrations, the research
sites either volunteered to participate or were selected
based on criteria, such as caseload size and
administrative capacity, which did not make their
caseloads representative of the state's welfare
population as a whole. For example, in Florida, sites
were encouraged to volunteer in the Family Transition
Program. The two sites eventually selected were chosen
because they had extensive community involvement and
resources that could be committed to the project.5 In addition, random
assignment was phased in so as not to overload the
capacity of the new program to provide the promised
services. Thus, the findings are unlikely to be
representative of what would happen elsewhere in the
state (much less the nation), especially if implemented
on a large scale.
The evaluations of
the new welfare law will employ a variety of research
methods, including randomized experiments,
quasi-experimental and nonexperimental designs,
ethnographic studies, and implementation research. Each
has its own strengths and weaknesses. The method used
should be linked to the particular questions asked, the
shape of the program, and the available resources.
In assessing the
adequacy of an evaluation's research design, questions
such as the following should be asked:
- Are the impact
estimates unbiased (internal validity)? How was
bias (or potential bias) monitored and controlled
for? Were these techniques appropriate?
- Are the
findings generalizable to larger populations
(external validity)? If not, how does this limit
the usefulness of the findings?
Data
Collection. Allen once observed that
"adequate data collection can be the Achilles heel
of social experimentation."6 Indeed, many evaluations are
launched without ensuring that adequate data collection
and processing procedures are in place. According to
Fein, "Typical problems include delays in receiving
data, receiving data for the wrong sample or in the wrong
format, insufficient documentation of data structure and
contents, difficulties in identifying demonstration
participants, inconsistencies across databases, and
problems created when states convert from old to new
eligibility systems."7 Careful data collection is
essential for evaluation findings to be credible.
The data used to
evaluate the new welfare law will come from
administrative records and specially designed sample
surveys. In addition, some evaluations may involve the
administration of standardized tests, qualitative or
ethnographic observations, and other information
gathering approaches. Each of these has its own strengths
and limitations.
Because
administrative data are already collected for program
purposes, they are relatively inexpensive to use for
research purposes. For some variables, administrative
data may be more accurate than survey data, because they
are not subject to nonresponse and recall problems, as
surveys are.
Some administrative
data, however, may be inaccurate, particularly those that
are unnecessary for determining program eligibility or
benefit amounts. In addition, they may not be available
for some outcomes or may cover only part of the
population being studied. For example, information
related to family structure would only be available for
the subset of cases that are actually receiving
assistance.
The primary
advantage of surveys is that they enable researchers to
collect the data that are best suited for the analysis.
However, nonresponse and the inability (or unwillingness)
of respondents to report some outcomes accurately can
result in missing or inaccurate data. Moreover, surveys
can be expensive. Thus, many evaluations use several
different data sources.
Unfortunately,
evaluation designs are sometimes selected before
determining whether the requisite data are available. For
example, New Jersey's Realizing Economic Achievement
(REACH) program was evaluated by comparing the outcomes
of a cohort of similar individuals in an earlier period
using state-level data. The evaluator concluded that
"shortcomings in the basic evaluation design . . .
and severe limitations in the scope and quality of the
data available for analysis, make it impossible to draw
any policy-relevant conclusions from the results."8
Although very few
social research efforts have achieved complete coverage
of all the subjects from which data are desired,
well-conducted research can achieve acceptably high
response rates. Several welfare reform demonstrations
have been plagued by low response rates, some as low as
30 percent. A high nonresponse rate to a survey or to
administrative data collection efforts can limit severely
the internal and external validity of the findings. Even
when response rates are high, all data collection efforts
end up with some missing or erroneous data; adequate data
collection minimizes missing observations and missing
information on observations made.
The new welfare law
significantly complicates data collection and analysis.
It will be more difficult to obtain reliable data and
data that are consistent across states and over time
because states can now change the way they provide
assistance. Under past law, both the population and the
benefits were defined by federal standards; under the new
law, however, the eligible population(s) may vary
considerably and the benefits may take many forms (such
as cash, noncash assistance, services, and employment
subsidies). This will make it more difficult to compare
benefit packages, since providing aid in forms other than
direct cash assistance raises serious valuation problems.
In addition, states
may separate federal and state funds to create new
assistance programs. One reason for such a split is that
the new law imposes requirements on programs funded with
federal dollars, but states have more flexibility with
programs financed by state funds. This may have
unintended consequences related to data analysis. For
example, states may choose to provide assistance with
state-funded programs to recipients after they reach the
federally mandated time limit. An analysis of welfare
spells would identify this as a five-year spell, when in
fact welfare receipt would have continued, just under a
different program name. Even if states submitted data on
their programs, capturing the total period of welfare
receipt would require an ability to match data from
different programs.
It will be
especially difficult to compare events before and after
the implementation of the new law, let alone across
states and localities. The Census Bureau is already
struggling with such issues. For example, until 1996, all
states had AFDC programs, but under TANF, they may
replace AFDC with one or more state programs, each with
its own name. Simply asking survey members about what
assistance they receive now requires considerable
background work in each state to identify the programs to
be included in the survey.
In assessing the
adequacy of an evaluation's data collection, questions
such as the following should be asked:
- Are the data
sources appropriate for the questions being
studied?
- Are the data
complete? What steps were taken to minimize
missing data? For example, for survey-based
findings, what procedures were used to obtain
high response rates?
- Is the sample
size sufficiently large to yield precise impact
estimates, both overall and for important
subgroups?
- Are the data
accurate? How was accuracy verified?
- What
statistical or other controls were used to
correct for potential bias resulting from missing
or erroneous data? Were these techniques
appropriate?
- What are the
implications of missing or erroneous data for the
findings?
Program
Implementation. Key to understanding the success
or failure of a program is how well it is implemented.
Accordingly, a critical issue in evaluating programs is
the degree to which they are implemented in accordance
with original plans and the nature and extent of any
deviations. Descriptive studies of program implementation
are necessary for that understanding and for assessing
the program's evaluation.
No matter how
well-designed and implemented an evaluation may be, if
the program was not implemented well, its impact findings
may be of little use for policymaking. For example, the
impact assessment of Wisconsin's "Learnfare"
found that the program had virtually no impact on school
attendance, high school graduation, and other related
outcomes.9 The implementation study found that
welfare staff experienced difficulties in obtaining the
necessary attendance data to ensure school attendance and
that penalties for noncompliance were rarely enforced.
Thus, the implementation analysis demonstrated that the
initiative was never really given a fair test and
provided important information to help state
decisionmakers fine-tune their program.
In assessing the
adequacy of an evaluation of a program's implementation,
questions such as the following should be asked:
- Is the program
or policy being evaluated fully described?
- Does the
evaluation describe how the policy changes were
implemented and operated?
- If defective,
how did poor implementation affect estimates of
effectiveness?
Measurement.
Process and outcome variables must have reliable
and valid measures. For most evaluations, the principal
variables are those measuring program participation,
services delivered, and outcomes achieved. An evaluation
of a program that attempts to move clients to employment
in the private sector clearly needs reliable and valid
measures of labor force participation. A program designed
to bolster attitudes related to the "work
ethic" needs to measure changes in such attitudes as
carefully as possible. (Adequate research procedures
include advance testing of measurement instruments to
determine their statistical properties and validity.)
Especially important
are measures of outcomes for which there is no long
history of measurement efforts. Because of the half
century of concern with measuring labor force
participation, such measures have characteristics and
statistical properties that are well known. In contrast,
social scientists have much less experience measuring
such concepts as "work ethic" attitudes, the
"well-being" of children, or household and
family structures. Many welfare reform efforts now
underway are likely to have goals that imply the use of
such measures. Whatever measures of such new concepts are
used need to be examined carefully in order to understand
their properties and validity. (The better evaluations
will report in detail about how measures were constructed
and tested for reliability and validity.)
In some cases, the
intervention itself may affect the measurement of an
outcome. For example, Wisconsin's "Learn-fare"
program requires that AFDC teens meet strict school
attendance standards or face a reduction in their
benefits. The Learnfare mandate relies on teachers and
school systems to submit attendance data. Garfinkel and
Manski observe that the program may have changed
attendance reporting practices:
It has been
reported that, in some schools, types of absences
that previously were recorded as
"unexcused" are now being recorded as
"excused" or are not being recorded at
all. In other schools, reporting may have been
tightened. The explanation offered is that
Learnfare has altered the incentives to record
attendance accurately. Some teachers and
administrators, believing the program to be
unfairly punitive, do what they can to lessen its
effects. Others, supporting the program, act to
enhance its impact.10
In short, program
interventions (and sometimes evaluations themselves) can
change the measurement of important outcomes.
In assessing the
adequacy of an evaluation's process and outcome measures,
questions such as the following should be asked:
- Were all
appropriate and relevant variables measured?
- Were the
measurements affected by response and recall
biases? Did subjects misrepresent data for
various reasons? Were there Haw-thorne effects;
that is, did the act of measurement affect the
outcome?
Analytical
Models. Data collected in evaluations need to be
summarized and analyzed by using statistical models that
are appropriate to the data and to the substantive issues
of the evaluation.
For example, if an
important substantive question is whether certain kinds
of welfare clients are most likely to obtain long-term
employment, the analytical models used must be
appropriate to the categorical nature of employment
(i.e., a person is either employed or not) and have the
ability to take into account the multivariate character
of the likely correlates of employment.
Critical
characteristics of good analytic models include adequate
specification (the variables included are substantively
relevant) and proper functional form (the model is
appropriate to the statistical properties of the data
being analyzed). This is particularly important for
quasi-experimental and nonexperimental evaluations.
Developing
appropriate analytical models for quasi-experiments has
been the subject of much debate. LaLonde11 and Fraker and Maynard12 compared the findings from
an experimental evaluation of the National Supported Work
(NSW) demonstration to those derived using comparison
groups drawn from large national surveys that used
statistical models purporting to correct for selection
biases. The estimated impacts varied widely in the
quasi-experimental models and, most importantly, differed
from the experimentally derived estimates. LaLonde found
that "even when the econometric tests pass
conventional specification tests, they still fail to
replicate the experimentally determined results."13
Not all researchers
share these concerns. Heckman and Smith criticize the
earlier studies of LaLonde14 and Fraker and Maynard15 by arguing that the problem
was not with nonexperimental methods per se, but with the
use of incorrect models in the analyses.16 They also claim that the
earlier studies did not "utilize a variety of
model-selection strategies based on standard
specification tests."17 They add that earlier work
by Heckman and Hotz,18 using the NSW data,
"successfully eliminates all but the nonexperimental
models that reproduce the inference obtained by
experimental methods."19 Thus, they conclude that
specification tests can be a powerful tool in analyzing
data from quasi-experimental designs. (The complexity of
the statistical issues that arise in some evaluations is
clearly beyond the scope of most policymakers.)
In assessing the
adequacy of an evaluation's analytical models, questions
such as the following should be asked:
- Were
appropriate statistical models used?
- Were the models
used tested for specification errors?
Interpretation
of Findings. No matter how well analyzed
numerically, numbers do not speak for themselves nor do
they speak directly to policy issues. An adequate
evaluation is one in which the findings are interpreted
in an even-handed manner, with justifiable statements
about the substantive meaning of the findings. The
evaluation report should disclose the limitations of the
data analyses and present alternate interpretations.
The data resulting
from an evaluation often can be analyzed in several ways,
each of which may lead to somewhat different
interpretations. An example of how alternative analysis
modes can affect interpretations is found in an MDRC
report on California's Greater Avenues for Independence
(GAIN) program.20 GAIN is a statewide employment and
training program for AFDC recipients, evaluated by MDRC
in six counties, ranging from large urban areas, such as
Los Angeles and San Diego, to relatively small counties,
such as Butte and Tulare. The report presented impact
findings for all six counties separately, as well as
together, for three years of program operation.
In presenting the
aggregate impacts, MDRC gave each county equal weight. As
a result, Butte, which represented less than 1 percent of
the state's AFDC caseload, had the same weight as Los
Angeles, which had almost 34 percent of the state's
caseload. Using this approach, MDRC reported that GAIN
increased earnings by $1,414 and reduced AFDC payments by
$961 over a three-year follow-up period. This gives
smaller counties a disproportionate weight in the
calculation of aggregate statewide impacts, but was
chosen by MDRC because "it is simple and does not
emphasize the strong or weak results of any one
county."21 MDRC examined other weighting
options. For example, it weighted the impacts according
to each county's GAIN caseload. This resulted in an
earnings increase of $1,333 and an AFDC payment reduction
of $1,087. Although the impact estimates are somewhat
similar to those using the first weighting method, the
differences are not trivial.
The impacts could
also have been weighted based on each county's AFDC
caseload, but this option was not discussed. Although Los
Angeles county comprised 33.7 percent of the state's AFDC
caseload, its share of the GAIN caseload was just 9.7
percent. In contrast, San Diego county represented just
7.4 percent of the AFDC caseload, but 13.3 percent of the
GAIN caseload.22 As a result, these counties would
have very different effects on the aggregate impact
estimates, depending on which weighting mechanism is
used. Clearly, the interpretation of research findings
can be influenced by the ways in which the findings from
sites are combined to form overall estimates of
effectiveness.
In assessing the
adequacy of an evaluation's interpretation of findings,
questions such as the following should be asked:
- When
alternative analysis strategies are possible, did
the evaluation show how sensitive findings are to
the use of such alternatives?
- Are alternative
interpretations of the data discussed?
- Are important
caveats regarding the findings stated?
� 1997 by the University of Maryland,
College Park, Maryland. All rights reserved. No part
of this publication may be used or reproduced in any manner
whatsoever without permission in writing from the University of
Maryland except in cases of brief quotations embodied in news
articles, critical articles, or reviews. The views
expressed in the publications of the University of Maryland are
those of the authors and do not necessarily reflect the views of
the staff, advisory panels, officers, or trusties of the
University of Maryland
Back to top