6 estimates (Bakeman & Gottman, 1986; Jacob, Tennenbaum, & Krahn, 1987). Users of observation systems are more frequently advocating the use of the kappa statistic than percent agreement because kappa corrects for chance agreement (Hops, Davis, and Longoria, 1995; Suen & Ary, 1989). Kappa is defined as the ratio of actual nonchance agreements divided by the total possible nonchance agreements (Suen & Ary, 1989). The range of possible kappa values extends from 1.00 to 1.00. As the values approach zero and negative numbers, reliability is considered to be at chance levels of agreement or lower. Kappa values above .75 are considered excellent, values from .60 to .75 are considered good, and values from .40 to .60 are considered fair (Fleiss, 1981). Considered the most comprehensive estimate of reliability, the intraclass correlation coefficient method utilizes the procedures of a two-way analysis of variance (ANOVA) and incorporates tests of both interobserver and intraobserver reliability. Factors are tested for their ability to explain variance in the dependent variables of interest. When behavior is observed across several observers and subjects, the variance in the behavioral scores can be examined for differences among observers (an unwanted source of variance), differences among subjects (true score variance), and random error. Overall, intraclass correlations have been described in positive terms and are seen as broadening the scope of analysis for reliability studies (Hartmann & Wood, 1990). Validity In addition to establishing consistency in the coding of observational data, an assessment device must be shown to measure what it purports to measure. Hops et al. (1995) discuss the concept of validity as it applies to direct observation. Estimates of