Evaluating Welfare Reform: Evaluating the Evaluations

Evaluating the Evaluations

As the foregoing discussion of studies suggests, the next few years will witness a veritable flood of new evaluation reports. The total body of research will be large, complex, and likely to lead to diverse and contradictory findings.

The Need

Many of the evaluations will provide important information about the impact of the new welfare regime on individuals and institutions. They will identify the difficulties and successes that states have had in implementing their reforms, and estimate the impacts of such reforms on the well-being of the poor, especially on their children. These findings, in turn, can help policymakers choose between various program approaches. For example, after MDRC documented the apparent success of "labor force attachment strategies" in reducing welfare caseloads, many states adopted them.

However, many of the evaluations will have such serious flaws that their utility will be sharply limited. For example, because of design and implementation problems, no one may ever know whether New Jersey's "family cap" had any impact on the birth rates of mothers on welfare. (Recently, two outside experts reviewed the evaluation of New Jersey's Family Development Program, which included a family cap provision. They concluded that there were serious methodological flaws in the evaluation, so an interim report was not released.)

Evaluations can go wrong in many ways. Some have such obvious faults that almost anyone can detect them. Other flaws can be detected only by experts with long experience and high levels of judgment.

The "100-hour rule" demonstrations are an example of the need for the expert review of evaluations. The AFDC-UP Program (abolished by TANF) provided benefits to two-parent families if the principal earner had a significant work history and worked less than 100 hours per month. Because this latter requirement, the so-called "100-hour rule," was thought to create a disincentive for full-time employment, the FSA authorized a set of experiments to alter the rule. Three states (California, Utah, and Wisconsin) initiated demonstrations to evaluate the impact of softening the rule.

Findings from these evaluations suggest that eliminating the rule for current recipients had little impact on welfare receipt, employment, or earnings. But in a recent review, Birnbaum and Wiseman identified many flaws in these studies.¹ First, random assignment procedures were undermined in all three states, so the treatment and control groups were not truly comparable. Second, the states did a poor job of explaining the policy change to the treatment group, limiting its impact on client behavior. Third, some outcomes, such as those related to family structure, were poorly measured.

The proper use of these forthcoming evaluations requires the ability to distinguish relevant and valid findings from those that are not. This does not mean that studies must be perfect in order to be useful. Research projects entirely without flaws do not exist and, arguably, never will.

Almost every evaluation is compromised by programmatic, funding, time, or political constraints. No program has been implemented with absolute fidelity to the original design. No sampling plan has ever been without faults. Some observations and data are missing from every data set. Analytical procedures are always misspecified to some degree. In other words, evaluation findings are only more credible or less so, and even poorly designed and executed evaluations can contain some information worth noting.

Devolution has further increased the need for careful, outside reviews of research findings. Previously, the federal government required a rigorous evaluation in exchange for granting state waivers, and federal oversight of the evaluations provided some quality control. In keeping with the new welfare law's block-grant approach, the federal government's supervision of the evaluations of state-based welfare initiatives will be curtailed: States are no longer required to evaluate their reforms and, if they do, they can choose any methodology they wish.

Already, there are indications that state discretion under TANF will lead to a proliferation of evaluation designs, some rigorous but many not. As Galster observes, "Many state agencies either lack policy evaluation and research divisions altogether, or use standards for program evaluation that are not comparable to those set by their federal counterparts. The quantity and quality of many state-initiated evaluations of state-sponsored programs may thus prove problematic."²

The number of studies purporting to evaluate welfare reform will grow rapidly in the years to come. The challenge facing policymakers and practitioners will be to sort through the many studies and identify those that are credible. It is a task that will be complicated by the volume and complexity of the studies, and the highly charged political atmosphere that surrounds them.

Tension is already building between the conservative lawmakers responsible for crafting the welfare bill and the predominantly liberal scholars involved in monitoring and evaluating it. Many of the researchers now studying the effects of the welfare law were also vocal critics of it. For example, the Urban Institute's $50 million project to assess the "unfolding decentralization of social programs" is being conducted by the same organization whose researchers, in a highly controversial study, claimed that the new law would push 2.6 million people, including 1.1 million children, into poverty.³

This has caused some conservatives in Congress to worry that "pseudo-academic research" will unfairly portray the effects of the welfare overhaul.⁴ Undoubtedly, some on the left as well as the right will misuse or oversimplify research findings to their own advantage, but even the perception of bias can limit the policy relevance of research. Good research should be identified, regardless of the ideological stripes of its authors.

Review Criteria

The key issue is the extent to which a discerned fault reduces the credibility of a study. Unfortunately, most policymakers and practitioners are ill-equipped to judge which faults are fatal, especially since they often must act before the traditional scholarly process can filter out invalid results. This is understandable, since assessing evaluation studies often requires both detailed knowledge of the programs involved and a high level of technical expertise.

To help them better assess this research and glean the lessons it offers, this paper also describes and explains the generally accepted criteria for judging evaluations. The criteria, of course, are not equally applicable to all evaluations.

Program "Theory". Underlying every program's design is some theory or model of how the program is conceived to work and how it matches the condition it is intended to ameliorate. An evaluation of the program should describe the underlying social problem it is intended to address and how the causal processes described in the model are expected to achieve program goals. Hence, a critical issue in assessing evaluations is the adequacy of program models.

Special problems are presented by reforms that have several goals. Many of the waiver-based experiments are intended to achieve diverse objectives, such as increasing work effort and promoting stable families, and, thus, involve multiple interventions. Sometimes the processes can work at cross purposes, placing conflicting incentives on clients. For example, many states have simultaneously expanded earnings disregards and imposed strict time limits. As a result, families that go to work may be able to retain a modest cash grant as a result of the liberalized treatment of earnings, but if they want to conserve their time-limited benefits, they may choose not to take advantage of this incentive. Examination of program theory can reveal such conflicts and identify potential unwanted side effects.

In assessing the adequacy of an evaluation's program theory, questions such as the following should be raised:

Is there an adequate description of the underlying social problem the intervention is meant to address?
Does the intervention make sense in light of existing social science theory and previous evaluations of similar interventions?
Are the hypothesized causal processes by which the reform effort is intended to achieve its goals clearly stated?
Have potential unwanted side effects been identified?

Research Design. An evaluation's research design is crucial to its ability to answer, in credible ways, substantive questions about program effectiveness. There are two central issues in research design: (1) "internal validity," or the ability to rule out alternative interpretations of research findings; and (2) "external validity," or the ability to support generalizations from findings to larger populations of interest.

For example, an evaluation that is based solely on measures of client employment levels taken before and after a reform is instituted lacks strong internal validity because any observed changes in employment levels cannot be uniquely attributed to the reform measures. Similarly, an implementation study of one welfare office in a state system with scores of such offices is of limited external validity because the office studied may not fairly represent all the others.

The effectiveness of a program is measured by comparing what happens when a program is in place to what happens without the program, the "counterfactual." A critical issue is how the evaluation is designed to estimate this difference.

In this respect, randomized experimental designs are considered to be superior to other designs. (Experimental and quasi-experimental designs are discussed in Appendix A.) In a randomized experiment, individuals or families (or other units of analysis) are randomly assigned to either a treatment group to whom the program is given or a control group from whom the program is withheld. If properly conducted, random assignment should result in two groups that, initially, are statistically comparable to one another. Thus, any differences in outcomes between the groups can be attributed to the effects of the intervention with a known degree of statistical precision. Random assignment rules out other possible influences, except for the intervention itself, and therefore has strong internal validity.

Although random assignment is usually the most desirable design, it is not always feasible, especially when a program enrolls all or most of its clientele. Quasi-experimental designs are then employed. They rely on identifying a comparison group with characteristics similar to those of the treatment group, but from another geographic area or time period or otherwise unexposed to the new policy. In some cases, the outcomes of those subject to a new welfare policy may be compared before and after exposure to the new policy.

The major difficulty with quasi-experimental designs is that the members of comparison groups may differ in some unmeasured or undetectable ways from those who have been exposed to the particular program or intervention. Typically, quasi-experimental designs employ statistical analyses to control for such differences, but how well this is done is open to debate. As a result, their internal validity is not as strong as with randomized experiments. Judging the strength of an evaluation design's internal validity should be an issue at the center of any assessment.

External validity is also crucial for policy purposes. Even an extremely well-designed evaluation with high internal validity is not useful to policymakers if its findings cannot be extrapolated to the program's total clientele.

In large part, an evaluation's external validity depends on how the research population is selected. In many of the waiver-based welfare reform demonstrations, the research sites either volunteered to participate or were selected based on criteria, such as caseload size and administrative capacity, which did not make their caseloads representative of the state's welfare population as a whole. For example, in Florida, sites were encouraged to volunteer in the Family Transition Program. The two sites eventually selected were chosen because they had extensive community involvement and resources that could be committed to the project.⁵ In addition, random assignment was phased in so as not to overload the capacity of the new program to provide the promised services. Thus, the findings are unlikely to be representative of what would happen elsewhere in the state (much less the nation), especially if implemented on a large scale.

The evaluations of the new welfare law will employ a variety of research methods, including randomized experiments, quasi-experimental and nonexperimental designs, ethnographic studies, and implementation research. Each has its own strengths and weaknesses. The method used should be linked to the particular questions asked, the shape of the program, and the available resources.

In assessing the adequacy of an evaluation's research design, questions such as the following should be asked:

Are the impact estimates unbiased (internal validity)? How was bias (or potential bias) monitored and controlled for? Were these techniques appropriate?
Are the findings generalizable to larger populations (external validity)? If not, how does this limit the usefulness of the findings?

Data Collection. Allen once observed that "adequate data collection can be the Achilles heel of social experimentation."⁶ Indeed, many evaluations are launched without ensuring that adequate data collection and processing procedures are in place. According to Fein, "Typical problems include delays in receiving data, receiving data for the wrong sample or in the wrong format, insufficient documentation of data structure and contents, difficulties in identifying demonstration participants, inconsistencies across databases, and problems created when states convert from old to new eligibility systems."⁷ Careful data collection is essential for evaluation findings to be credible.

The data used to evaluate the new welfare law will come from administrative records and specially designed sample surveys. In addition, some evaluations may involve the administration of standardized tests, qualitative or ethnographic observations, and other information gathering approaches. Each of these has its own strengths and limitations.

Because administrative data are already collected for program purposes, they are relatively inexpensive to use for research purposes. For some variables, administrative data may be more accurate than survey data, because they are not subject to nonresponse and recall problems, as surveys are.

Some administrative data, however, may be inaccurate, particularly those that are unnecessary for determining program eligibility or benefit amounts. In addition, they may not be available for some outcomes or may cover only part of the population being studied. For example, information related to family structure would only be available for the subset of cases that are actually receiving assistance.

The primary advantage of surveys is that they enable researchers to collect the data that are best suited for the analysis. However, nonresponse and the inability (or unwillingness) of respondents to report some outcomes accurately can result in missing or inaccurate data. Moreover, surveys can be expensive. Thus, many evaluations use several different data sources.

Unfortunately, evaluation designs are sometimes selected before determining whether the requisite data are available. For example, New Jersey's Realizing Economic Achievement (REACH) program was evaluated by comparing the outcomes of a cohort of similar individuals in an earlier period using state-level data. The evaluator concluded that "shortcomings in the basic evaluation design . . . and severe limitations in the scope and quality of the data available for analysis, make it impossible to draw any policy-relevant conclusions from the results."⁸

Although very few social research efforts have achieved complete coverage of all the subjects from which data are desired, well-conducted research can achieve acceptably high response rates. Several welfare reform demonstrations have been plagued by low response rates, some as low as 30 percent. A high nonresponse rate to a survey or to administrative data collection efforts can limit severely the internal and external validity of the findings. Even when response rates are high, all data collection efforts end up with some missing or erroneous data; adequate data collection minimizes missing observations and missing information on observations made.

The new welfare law significantly complicates data collection and analysis. It will be more difficult to obtain reliable data and data that are consistent across states and over time because states can now change the way they provide assistance. Under past law, both the population and the benefits were defined by federal standards; under the new law, however, the eligible population(s) may vary considerably and the benefits may take many forms (such as cash, noncash assistance, services, and employment subsidies). This will make it more difficult to compare benefit packages, since providing aid in forms other than direct cash assistance raises serious valuation problems.

In addition, states may separate federal and state funds to create new assistance programs. One reason for such a split is that the new law imposes requirements on programs funded with federal dollars, but states have more flexibility with programs financed by state funds. This may have unintended consequences related to data analysis. For example, states may choose to provide assistance with state-funded programs to recipients after they reach the federally mandated time limit. An analysis of welfare spells would identify this as a five-year spell, when in fact welfare receipt would have continued, just under a different program name. Even if states submitted data on their programs, capturing the total period of welfare receipt would require an ability to match data from different programs.

It will be especially difficult to compare events before and after the implementation of the new law, let alone across states and localities. The Census Bureau is already struggling with such issues. For example, until 1996, all states had AFDC programs, but under TANF, they may replace AFDC with one or more state programs, each with its own name. Simply asking survey members about what assistance they receive now requires considerable background work in each state to identify the programs to be included in the survey.

In assessing the adequacy of an evaluation's data collection, questions such as the following should be asked:

Are the data sources appropriate for the questions being studied?
Are the data complete? What steps were taken to minimize missing data? For example, for survey-based findings, what procedures were used to obtain high response rates?
Is the sample size sufficiently large to yield precise impact estimates, both overall and for important subgroups?
Are the data accurate? How was accuracy verified?
What statistical or other controls were used to correct for potential bias resulting from missing or erroneous data? Were these techniques appropriate?
What are the implications of missing or erroneous data for the findings?

Program Implementation. Key to understanding the success or failure of a program is how well it is implemented. Accordingly, a critical issue in evaluating programs is the degree to which they are implemented in accordance with original plans and the nature and extent of any deviations. Descriptive studies of program implementation are necessary for that understanding and for assessing the program's evaluation.

No matter how well-designed and implemented an evaluation may be, if the program was not implemented well, its impact findings may be of little use for policymaking. For example, the impact assessment of Wisconsin's "Learnfare" found that the program had virtually no impact on school attendance, high school graduation, and other related outcomes.⁹ The implementation study found that welfare staff experienced difficulties in obtaining the necessary attendance data to ensure school attendance and that penalties for noncompliance were rarely enforced. Thus, the implementation analysis demonstrated that the initiative was never really given a fair test and provided important information to help state decisionmakers fine-tune their program.

In assessing the adequacy of an evaluation of a program's implementation, questions such as the following should be asked:

Is the program or policy being evaluated fully described?
Does the evaluation describe how the policy changes were implemented and operated?
If defective, how did poor implementation affect estimates of effectiveness?

Measurement. Process and outcome variables must have reliable and valid measures. For most evaluations, the principal variables are those measuring program participation, services delivered, and outcomes achieved. An evaluation of a program that attempts to move clients to employment in the private sector clearly needs reliable and valid measures of labor force participation. A program designed to bolster attitudes related to the "work ethic" needs to measure changes in such attitudes as carefully as possible. (Adequate research procedures include advance testing of measurement instruments to determine their statistical properties and validity.)

Especially important are measures of outcomes for which there is no long history of measurement efforts. Because of the half century of concern with measuring labor force participation, such measures have characteristics and statistical properties that are well known. In contrast, social scientists have much less experience measuring such concepts as "work ethic" attitudes, the "well-being" of children, or household and family structures. Many welfare reform efforts now underway are likely to have goals that imply the use of such measures. Whatever measures of such new concepts are used need to be examined carefully in order to understand their properties and validity. (The better evaluations will report in detail about how measures were constructed and tested for reliability and validity.)

In some cases, the intervention itself may affect the measurement of an outcome. For example, Wisconsin's "Learn-fare" program requires that AFDC teens meet strict school attendance standards or face a reduction in their benefits. The Learnfare mandate relies on teachers and school systems to submit attendance data. Garfinkel and Manski observe that the program may have changed attendance reporting practices:

It has been reported that, in some schools, types of absences that previously were recorded as "unexcused" are now being recorded as "excused" or are not being recorded at all. In other schools, reporting may have been tightened. The explanation offered is that Learnfare has altered the incentives to record attendance accurately. Some teachers and administrators, believing the program to be unfairly punitive, do what they can to lessen its effects. Others, supporting the program, act to enhance its impact.¹⁰

In short, program interventions (and sometimes evaluations themselves) can change the measurement of important outcomes.

In assessing the adequacy of an evaluation's process and outcome measures, questions such as the following should be asked:

Were all appropriate and relevant variables measured?
Were the measurements affected by response and recall biases? Did subjects misrepresent data for various reasons? Were there Haw-thorne effects; that is, did the act of measurement affect the outcome?

Analytical Models. Data collected in evaluations need to be summarized and analyzed by using statistical models that are appropriate to the data and to the substantive issues of the evaluation.

For example, if an important substantive question is whether certain kinds of welfare clients are most likely to obtain long-term employment, the analytical models used must be appropriate to the categorical nature of employment (i.e., a person is either employed or not) and have the ability to take into account the multivariate character of the likely correlates of employment.

Critical characteristics of good analytic models include adequate specification (the variables included are substantively relevant) and proper functional form (the model is appropriate to the statistical properties of the data being analyzed). This is particularly important for quasi-experimental and nonexperimental evaluations.

Developing appropriate analytical models for quasi-experiments has been the subject of much debate. LaLonde¹¹ and Fraker and Maynard¹² compared the findings from an experimental evaluation of the National Supported Work (NSW) demonstration to those derived using comparison groups drawn from large national surveys that used statistical models purporting to correct for selection biases. The estimated impacts varied widely in the quasi-experimental models and, most importantly, differed from the experimentally derived estimates. LaLonde found that "even when the econometric tests pass conventional specification tests, they still fail to replicate the experimentally determined results."¹³

Not all researchers share these concerns. Heckman and Smith criticize the earlier studies of LaLonde¹⁴ and Fraker and Maynard¹⁵ by arguing that the problem was not with nonexperimental methods per se, but with the use of incorrect models in the analyses.¹⁶ They also claim that the earlier studies did not "utilize a variety of model-selection strategies based on standard specification tests."¹⁷ They add that earlier work by Heckman and Hotz,¹⁸ using the NSW data, "successfully eliminates all but the nonexperimental models that reproduce the inference obtained by experimental methods."¹⁹ Thus, they conclude that specification tests can be a powerful tool in analyzing data from quasi-experimental designs. (The complexity of the statistical issues that arise in some evaluations is clearly beyond the scope of most policymakers.)

In assessing the adequacy of an evaluation's analytical models, questions such as the following should be asked:

Were appropriate statistical models used?
Were the models used tested for specification errors?

Interpretation of Findings. No matter how well analyzed numerically, numbers do not speak for themselves nor do they speak directly to policy issues. An adequate evaluation is one in which the findings are interpreted in an even-handed manner, with justifiable statements about the substantive meaning of the findings. The evaluation report should disclose the limitations of the data analyses and present alternate interpretations.

The data resulting from an evaluation often can be analyzed in several ways, each of which may lead to somewhat different interpretations. An example of how alternative analysis modes can affect interpretations is found in an MDRC report on California's Greater Avenues for Independence (GAIN) program.²⁰ GAIN is a statewide employment and training program for AFDC recipients, evaluated by MDRC in six counties, ranging from large urban areas, such as Los Angeles and San Diego, to relatively small counties, such as Butte and Tulare. The report presented impact findings for all six counties separately, as well as together, for three years of program operation.

In presenting the aggregate impacts, MDRC gave each county equal weight. As a result, Butte, which represented less than 1 percent of the state's AFDC caseload, had the same weight as Los Angeles, which had almost 34 percent of the state's caseload. Using this approach, MDRC reported that GAIN increased earnings by $1,414 and reduced AFDC payments by $961 over a three-year follow-up period. This gives smaller counties a disproportionate weight in the calculation of aggregate statewide impacts, but was chosen by MDRC because "it is simple and does not emphasize the strong or weak results of any one county."²¹ MDRC examined other weighting options. For example, it weighted the impacts according to each county's GAIN caseload. This resulted in an earnings increase of $1,333 and an AFDC payment reduction of $1,087. Although the impact estimates are somewhat similar to those using the first weighting method, the differences are not trivial.

The impacts could also have been weighted based on each county's AFDC caseload, but this option was not discussed. Although Los Angeles county comprised 33.7 percent of the state's AFDC caseload, its share of the GAIN caseload was just 9.7 percent. In contrast, San Diego county represented just 7.4 percent of the AFDC caseload, but 13.3 percent of the GAIN caseload.²² As a result, these counties would have very different effects on the aggregate impact estimates, depending on which weighting mechanism is used. Clearly, the interpretation of research findings can be influenced by the ways in which the findings from sites are combined to form overall estimates of effectiveness.

In assessing the adequacy of an evaluation's interpretation of findings, questions such as the following should be asked:

When alternative analysis strategies are possible, did the evaluation show how sensitive findings are to the use of such alternatives?
Are alternative interpretations of the data discussed?
Are important caveats regarding the findings stated?

� 1997 by the University of Maryland, College Park, Maryland. All rights reserved. No part of this publication may be used or reproduced in any manner whatsoever without permission in writing from the University of Maryland except in cases of brief quotations embodied in news articles, critical articles, or reviews. The views expressed in the publications of the University of Maryland are those of the authors and do not necessarily reflect the views of the staff, advisory panels, officers, or trusties of the University of Maryland