Evaluating Welfare Reform

Appendix A

Experimental vs. Quasi-Experimental Designs

Many social welfare programs look successful--to their own staffs as well as to outsiders--because their clients seem to be doing so well. For example, a substantial proportion of trainees may have found jobs after having gone through a particular program. The question is: Did they get their jobs because of the program, or would they have done so anyway? Answering this question is the central task in evaluating the impact of a program or policy. In other words, what would have happened to the clients if they had not been in the program or subject to the policy.

The key task of an impact evaluation is to isolate and measure the program or policy's effects independent of other factors that might be at work, such as local economic conditions, the characteristics of participants, and the quality of the particular project's leadership. To do so, researchers try to establish the "counterfactual"; that is, they try to see what happened to a similar group that was not subject to the program or policy.

Researchers use either experimental or quasi-experimental designs to establish the counterfactual. After describing both approaches, this appendix summarizes their principal strengths and limitations, with illustrations from recent studies.

Experimental Designs

Many social scientists believe that experimental designs are the best way to measure a program or policy's impact. In an experimental design, individuals, families, or other units of analysis are randomly assigned to either a treatment or control group. The treatment group is subjected to the new program or policy, and the control group is not. The experience of the control group, thus, is meant to represent what would have happened but for the intervention.

If properly planned and implemented, an experimental design should result in treatment and control groups that have comparable measurable and unmeasurable aggregate characteristics (within the limits of chance variation). And, from the moment of randomization, they will be exposed to the same outside forces, such as economic conditions, social environments, and other events‹allowing any subsequent differences in average outcomes to be attributed to the intervention.

Thus, experimental designs ordinarily do not require complex statistical adjustments to eliminate differences between treatment and control groups. Policymakers can then focus on the implications of findings, rather than "become entangled in a protracted and often inconclusive scientific debate about whether the findings of a particular study are statistically valid."¹ As we will see, the same is not true for quasi-experiments.

In the last 30 years, experimental designs have been used to evaluate a wide range of social interventions, including housing allowances, health insurance reforms, the negative income tax, and employment and training programs. ² The evaluations of welfare-to-work programs conducted by Manpower Demonstration Research Corporation (MDRC) in the 1980s‹which used experimental designs‹are widely credited with having shaped the Family Support Act of 1988. ³ Similarly, in the 1990s, Abt Associates evaluated the Job Training Partnership Act (JTPA) program. ⁴ Its findings, also based on an experimental design, likewise led to major policy changes.

Experimental designs are not without disadvantages, however. They can raise substantial ethical issues, can be difficult to implement properly, and cannot be used for certain types of interventions. ⁵

Ethical issues arise, for example, when the treatment group is subjected to an intervention that may make its members worse off or when the control group is denied services that may be beneficial. ⁶ In the late 1980s, the state of Texas implemented a random assignment evaluation to test the impact of 12-month transitional child care and Medicaid benefits. When the study began, the treatment group was receiving a benefit (the transitional services) that was otherwise unavailable. Hence, denying the same benefit to the control group did not raise an ethical issue. But a year later, nearly identical transition benefits became mandatory under the Family Support Act. At that point, the control group was being denied what had become part of the national, legally guaranteed benefit package. In the face of complaints, the secretary of Health and Human Services required the control group to receive the benefits, thereby undercutting the experiment.

Sometimes, members of the program staff object to the denial of services built into the experimental design. When they view the experiment as unethical or fear that members of the control group will complain, they sometimes circumvent the procedures of the random assignment process, thus undermining the comparability of the treatment and control groups. This apparently happened, for example, in an evaluation of the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC). ⁷

Implementation issues arise in every study. ⁸ As Rossi and Freeman state, "the integrity of random assignment is easily threatened." ⁹ For example, "contamination," where control group members are also subjected to all or some of the intervention, was a problem in many of the waiver-based state welfare reform demonstrations. In many states, the new policies were applicable to all recipients in the state, except for a small control group (usually drawn from a limited number of sites). It was not uncommon for members of these control cases to migrate to other counties and receive the treatment elsewhere. Metcalf and Thornton add:

Typical forms of distortion or "contamination" include corruption of the random assignment mechanism, provision of the treatment intervention to controls despite proper randomization, other forms of "compensatory" treatment of controls or unintended changes in the control group environment, and distortion of referral flows.¹⁰

Statistical adjustments cannot always deal successfully with such problems.

In some experiments, members of either the control or treatment groups may not fully understand the rules to which they are subject. For example, in New Jersey's Family Development Program, it appears that many members of the control group believed that the family cap policy applied to them, probably because of the extensive statewide publicity this provision received.¹¹ Because this may have affected their behavior, it is unlikely that the impact of the demonstration can be determined by comparing the birth rates in the two groups.

In other cases, treatment group members were not made aware of all the changes that affected them. For example, in California, caseworkers did not initially inform recipients of the state's new work incentive provisions. The impact findings suggest that the demonstration had little effect on employment and earnings, but it is unclear whether this is because the policy was ineffective or because the recipients were unaware of its provisions.

Attrition from the research sample and nonresponse are other problems for experiments, although they can also afflict quasi-experiments. First, the research sample may become less representative of the original target population. For example, in the Georgia Preschool Immunization Project, the evaluator was only able to obtain permission to examine the immunization records from about half of the research sample.¹² If the immunization records for those not included are significantly different from those for whom data are available, nonresponse bias could be a problem. Second, if there is differential attrition between the treatment and control groups, their comparability is undermined and bias may be introduced.¹³ This would be especially problematic if the characteristics of those for whom immunization records are available differed systematically in the treatment versus the control group. For example, it may be that those in the treatment group who were in compliance were more likely to open their records to the evaluators‹a reasonable assumption, since those who are behind may hesitate for fear of being sanctioned. There was some evidence in the experiment of such differences, based on the characteristics of the clients at the time of random assignment. As a result, the evaluator had to adopt statistical techniques to control for bias.

There is a possibility that the welfare reform intervention, itself, might affect attrition. For example, treatment cases that lose benefits due to a time limit may move to another state to regain assistance, but there would be no corresponding incentive for control cases to do the same.

All of the foregoing implementation problems, of course, apply to quasi-experimental designs as well.

Some interventions are not amenable to randomized experiments. Experimental designs may not be appropriate for interventions that have significant entry effects. For example, a stringent work requirement may deter families from applying for assistance. This effect may not be captured in a random assignment evaluation, because it occurs before the family is randomly assigned.¹⁴ Some reforms may have community-wide effects.¹⁵ For example, they may change the culture of the welfare office, leading caseworkers to treat all clients‹treatment and control‹differently. Again, this change would not be captured by a simple treatment-control comparison of outcomes.

Furthermore, the random assignment process, itself, can affect the way the program works and the benefits or services available to control group members.¹⁶ For example, in the National JTPA evaluation, extensive outreach was necessary because the assignment of applicants to the control group left unfilled slots in the program. The applicants brought into the program were not the same as those who were effectively "displaced" when assigned to the control group. Thus, the impacts on those who were in the program may not correspond to the impacts on those who would have been in the program in the absence of the demonstration.

Quasi-Experimental Designs

When random assignment is not possible or appropriate, researchers often use quasi-experimental designs. In quasi-experiments, the counterfactual is established by selecting a "comparison" group whose members are not subject to the intervention but are nevertheless thought to be similar to the treatment group.

Participants vs. Nonparticipants. Participants in the program are compared to nonparticipants with similar characteristics on the assumption that both groups are affected by the same economic and social forces. But even though the two groups may appear similar, they may differ in unmeasurable, or difficult to measure, ways. For example, those who voluntarily enroll in a training program may have more desire to find a job than those who do not. Alternatively, those who do not enroll may want to work immediately or in other ways may be in less need of a training program. Both are possibilities that should be considered in interpreting a study's results. Statistical and other methods are sometimes used to control for such "selection effects," but success in doing so has been mixed.

The Urban Institute used this approach to evaluate the Massachusetts Employment and Training Program (ET) choices.¹⁷ It compiled a longitudinal database of information on about 17,000 AFDC recipients, of which half participated and half did not beyond initial registration and orientation. The nonparticipants served as a comparison group, selected through a statistical procedure that matched the comparison group members to the participant group on several measurable characteristics, including race, sex, age, and family composition. Some characteristics, such as motivation, could not be measured. Although the evaluators attempted to control for "selection bias," the results are still subject to uncertainty.

Comparison Sites. Individuals from other geographic areas are compared to those in the program. This avoids problems of comparing those who volunteer for a program to those who do not (selection effect), but creates other complications. In particular, statistical adjustments are needed for economic and demographic differences between the sites that may influence participant outcomes. This method works best when similar sites are matched, and when the treatment and comparison sites are selected randomly. But if sites can choose whether to receive the treatment or serve as the comparison, selection bias can be a problem. Also, sites that initially appear well matched may become less so, for reasons unrelated to the intervention. (Some events, such as a plant closing, can be especially problematic.)

In one of the rare exceptions to its requirement for an experimental design of waiver-based demonstrations, HHS allowed Wisconsin to evaluate its "Work Not Welfare" demonstration using a comparison site approach. However, Wisconsin selected as treatment sites the two counties that were most interested (and perhaps most likely to succeed) in implementing the demonstration. Besides this important attribute, it turns out that the two counties differed from others in the state on a number of other dimensions (for example, they had lower unemployment rates), thus complicating the analysis. One review of the evaluation plan concludes:

It is unlikely, however, that matched comparison counties and statistical models will adequately control for the fact that the demonstration counties were preselected. It may not be possible to separate the effects of the program from the effects of being in a county where program staff and administrators were highly motivated to put clients to work.¹⁸

Pre-Post Comparisons. Cohorts of similar individuals from different time periods are compared, one representing the "pre" period and one the "post" period. This also requires statistically controlling for differences between the groups. Using this approach, several studies examined the impact of the AFDC reforms in the 1981 Omnibus Budget Reconciliation Act (OBRA).¹⁹ One problem with pre-post evaluations is that external factors, such as changing economic conditions, may affect the variable of interest, so that the trend established before the new intervention is not as good a predictor of what would have otherwise happened. The evaluation of the 1981 OBRA changes, for instance, had to control for the 1981 1982 recession, which was the worst in 45 years. In fact, there are likely to be many changes and it is difficult to disentangle the impact of a reform initiative from changes that may occur in the economy or in other public policies. For example, studies using this methodology to examine welfare reform in the 1990s would have to control for expansion in the Earned Income Tax Credit (EITC) and the increase in the minimum wage, two important policy changes that could affect labor market outcomes.

Since there may be no more than a few years of data on the "pre" period, the length of follow-up for the "post" period is limited as well. This may be too short a time to test the long-term impact of important policy changes, especially since some changes, such as time limits, may not be fully effective for many years themselves. It also may not be possible to obtain data on some outcomes for the "pre" period, such as measures related to child well-being, particularly if they are not readily available on administrative records. In addition, detailed data on participant characteristics, economic conditions, and other relevant "control" variables are needed. For example, New Jersey's first welfare reform demonstration, the Realizing Economic Achievement (REACH) program, compared the outcomes of a cohort of recipients subject to REACH to a cohort of similar individuals from an earlier period. Unfortunately, the evaluator concluded that because the limitations with the historical data were so severe, it was not possible to draw any definitive conclusions from the results.²⁰

Another way of conducting a pre-post comparison is to examine those participating in the program before and after going through it. The outcomes for the group in the pre-program period serve as the comparison "group" for the same population after the program is implemented. (For example, the employment and earnings of individuals can be compared before and after participation in a training program.)

A major advantage of this design is that it requires data only on program participants. Unfortunately, as Rossi and Freeman note:

Although few designs have as much intuitive appeal as simple before-and-after studies, they are among the least valid assessment approaches. The essential feature of this approach is a comparison of the same targets at two points in time, separated by a period of participation in a program. The differences between the two measurements are taken as an estimate of the net effects of the intervention. The main deficiency of such designs is that ordinarily they cannot disentangle the effects of extraneous factors from the effects of the intervention. Consequently, estimates of the intervention's net effects are dubious at best.²¹

Comparisons with Secondary Data Sets. Secondary data sets, such as the Census Bureau's Survey of Income and Program Participation (SIPP), or its Current Population Survey (CPS), or other national or state-level data sources, have also been used to develop comparison groups. In such comparisons, a sample of similar persons is identified to represent what would have happened in the absence of the intervention. Many evaluations of the Comprehensive Employment and Training Act (CETA) employed this approach.²² As with other quasi-experimental methods, selection bias is a problem, because volunteers for the program are compared to nonparticipants. Moreover, complications can arise because the data for comparison group members derived from such secondary sources are generally cruder than for the treatment group.

Time Series/Cross-Sectional Studies. Time series and cross-sectional analyses use aggregate data to compare outcomes either over time or across states (or other political subdivisions), thus attempting to control for variables that can affect the outcome of interest, including a variable that represents the intervention itself. These methods have been commonly used by researchers, but are very sensitive to the specification of the model.²³

For example, one evaluation of the Massachusetts ET program used time series analysis.²⁴ A host of explanatory variables were used to reflect the importance of demographic, economic, and policy factors that would be expected to have an impact on the caseload, including a variable to measure the impact of the program being evaluated. The study found that the ET program did not lead to any significant reduction in the welfare rolls in Massachusetts, but the author cautioned:

Analysis of time series data is often complicated by the fact that many variables tend to change over time in similar ways. For this reason, it may be difficult to separate out accurately the impact of the different factors. Thus, the estimated effects of the explanatory variables may be unstable, changing from one specification of the model to others.²⁵

As is evident from the above discussion, a major problem with quasi-experimental designs is selection bias. This arises out of processes that influence whether persons are or are not program participants. Unmeasured differences in personal characteristics, such as the degree of motivation, rather than the program itself, could explain differential outcomes. Sometimes the selection processes are system characteristics, such as differences among welfare offices, which lead some to participate in reform efforts and others not to. Although there are a variety of statistical techniques to correct for selection bias, it is impossible to know with certainty which is most appropriate. And, since these methods result in different estimates, there is always some uncertainty regarding the findings of quasi experiments. Here is how Gary Burtless of the Brookings Institution put it:

Our uncertainty about the presence, direction, and potential size of selection bias makes it difficult for social scientists to agree on the reliability of estimates drawn from nonexperimental studies. The estimates may be suggestive, and they may even be helpful when estimates from many competing studies all point in the same direction. But if statisticians obtain widely differing estimates or if the available estimates are the subject of strong methodological criticism, policymakers will be left uncertain about the effectiveness of the program.²⁶

With experimental designs, such adjustments are unnecessary, since random assignment should equalize the treatment and control groups in terms of both observable and unobservable characteristics.

***

Experimental designs have long been the evaluation method of choice, and should probably be considered first in any evaluation. Many current welfare reform efforts, however, are not amenable to randomized experiments. The new program or policy may cover the entire state, without provision having been made for a control group; the changes made by the state may have affected norms and expectations across the entire community, sample, or agency, so that the control group's behavior was also influenced; and there may be substantial "entry effects," as described above.

Thus, in many circumstances, a quasi-experimental design will be the preferable approach. Although not nearly as problem-free as experimental designs, they can provide important information about new policies and programs.

The overriding point is that welfare reform efforts should be evaluated as best as possible and the design chosen should be the one most likely to succeed.

� 1997 by the University of Maryland, College Park, Maryland. All rights reserved. No part of this publication may be used or reproduced in any manner whatsoever without permission in writing from the University of Maryland except in cases of brief quotations embodied in news articles, critical articles, or reviews. The views expressed in the publications of the University of Maryland are those of the authors and do not necessarily reflect the views of the staff, advisory panels, officers, or trusties of the University of Maryland