week five discussion

Sources of test bias. Explain a source of test bias that can threaten validity of the test results. Include a peer-reviewed article that discusses test bias. What steps can be taken to reduce the risk of bias?

210

C H A P T E R 6

Ability Testing: Group Tests and

Controversies

T he practical success of early intelligence scales such as the 1905 Binet-Simon test motivated psychologists and educators to develop instruments that could be administered simultane- ously to large numbers of examinees. Test developers were quick to realize that group tests

allowed for the efficient evaluation of dozens or hundreds of examinees at the same time. As reviewed in an earlier chapter, one of the first uses of group tests was for screening and assignment of military personnel during World War I. The need to quickly test thousands of Army recruits inspired psychol- ogists in the United States, led by Robert M. Yerkes, to make rapid advances in psychometrics and test development (Yerkes, 1921). Many new applications followed immediately—in education, industry, and other fields. In Topic 6A, Group Tests of Ability and Related Concepts, we introduce the reader to the varied applications of group tests and also review a sampling of typical instruments. In addition, we explore a key question raised by the consequential nature of these tests—can examinees boost their scores significantly by taking targeted test preparation courses? This is but one of many unexpected issues raised by the widespread use of group tests. In Topic 6B, Test Bias and Other Controversies, we continue a reflective theme by looking into test bias and other contentious issues in testing.

NATURE, PROMISE, AND PITFALLS OF GROUP TESTS

Group tests serve many purposes, but the vast majority can be assigned to one of three types: ability, aptitude, or achievement tests. In the real world, the distinction among these kinds of tests often is quite fuzzy (Gregory, 1994a). These instruments differ mainly in their functions and

TOPIC 6A Group Tests of Ability and Related Concepts

Nature, Promise, and Pitfalls of Group Tests

Group Tests of Ability

Multiple Aptitude Test Batteries

Predicting College Performance

Postgraduate Selection Tests

Educational Achievement Tests

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 211

a hundredth of the time needed to administer the same test individually. Again, in certain comparative studies, e.g., of the effects of a week’s vacation upon the mental efficiency of school children, it becomes imperative that all S’s should take the tests at the same time. On the other hand, there are almost sure to be some S’s in every group that, for one rea- son or another, fail to follow instructions or to execute the test to the best of their abil- ity. The individual method allows E to detect these cases, and in general, by the exercise of personal supervision, to gain, as noted above, valuable information concerning S’s attitude toward the test.

In sum, group testing poses two interrelated risks: (1) some examinees will score far below their true ability, owing to motivational problems or dif- ficulty following directions and (2) invalid scores will not be recognized as such, with undesirable consequences for these atypical examinees. There is really no simple way to entirely avoid these risks, which are part of the trade-off for the efficiency of group testing. However, it is possible to minimize the potentially negative consequences if examiners scrutinize very low scores with skepticism and rec- ommend individual testing for these cases.

We turn now to an analysis of group tests in a variety of settings, including cognitive tests for schools and clinics, placement tests for career and military evaluation, and aptitude tests for college and postgraduate selection.

GROUP TESTS OF ABILITY

Multidimensional Aptitude Battery-II (MAB-II)

The Multidimensional Aptitude Battery-II ( MAB-II; Jackson, 1998) is a recent group intelligence test de- signed to be a paper-and-pencil equivalent of the WAIS-R. As the reader will recall, the WAIS-R is a highly respected instrument (now replaced by the WAIS-III), in its time the most widely used of the available adult intelligence tests. Kaufman (1983) noted that the WAIS-R was “the criterion of adult intelligence, and no other instrument even comes

applications, less so in actual test content. In brief, ability tests typically sample a broad assortment of proficiencies in order to estimate current intellectual level. This information might be used for screening or placement purposes, for example, to determine the need for individual testing or to establish eligi- bility for a gifted and talented program. In contrast, aptitude tests usually measure a few homogeneous segments of ability and are designed to predict fu- ture performance. Predictive validity is foundational to aptitude tests, and often they are used for institu- tional selection purposes. Finally, achievement tests assess current skill attainment in relation to the goals of school and training programs. They are designed to mirror educational objectives in reading, writing, math, and other subject areas. Although often used to identify educational attainment of students, they also function to evaluate the adequacy of school edu- cational programs.

Whatever their application, group tests differ from individual tests in five ways:

• Multiple-choice versus open-ended format • Objective machine scoring versus examiner

scoring • Group versus individualized administration • Applications in screening versus remedial

planning • Huge versus merely large standardization

samples

These differences allow for great speed and cost efficiency in group testing, but a price is paid for these advantages.

Although the early psychometric pioneers embraced group testing wholeheartedly, they rec- ognized fully the nature of their Faustian bargain: Psychologists had traded the soul of the individual examinee in return for the benefits of mass testing. Whipple (1910) summed up the advantages of group testing but also pointed to the potential perils:

Most mental tests may be administered either to individuals or to groups. Both methods have advantages and disadvantages. The group method has, of course, the particular merit of economy of time; a class of 50 or 100 chil- dren may take a test in less than a fiftieth or

212 Chapter 6 • Ability Testing: Group Tests and Controversies

Spatial subtest on the MAB-II. In the Spatial subtest, examinees must mentally perform spatial rotations of figures and select one of five possible rotations presented as their answer (Figure 6.1). Only mental rotations are involved (although “flipped-over” ver- sions of the original stimulus are included as distrac- tor items). The advanced items are very complex and demanding.

The items within each of the 10 MAB-II sub- tests are arranged in order of increasing difficulty, beginning with questions and problems that most adolescents and adults find quite simple and pro- ceeding upward to items that are so difficult that very few persons get them correct. There is no pen- alty for guessing and examinees are encouraged to respond to every item within the time limit. Unlike the WAIS-R in which the verbal subtests are untimed power measures, every MAB-II subtest incorporates elements of both power and speed: Examinees are al- lowed only seven minutes to work on each subtest. Including instructions, the Verbal and Performance portions of the MAB-II each take about 50 minutes to administer.

The MAB-II is a relatively minor revision of the MAB, and the technical features of the two versions are nearly identical. A great deal of psy- chometric information is available for the original version, which we report here. With regard to reli- ability, the results are generally quite impressive. For example, in one study of over 500 adolescents rang- ing in age from 16 to 20, the internal consistency re- liability of Verbal, Performance, and Full Scale IQs was in the high .90s. Test–retest data for this instru- ment also excel. In a study of 52 young psychiatric patients, the individual subtests showed reliabilities that ranged from .83 to .97 (median of .90) for the Verbal scale and from .87 to .94 (median of .91) for the Performance scale (Jackson, 1984). These re- sults compare quite favorably with the psychometric standards reported for the WAIS-R.

Factor analyses of the MAB-II are broadly supportive of the construct validity of this instru- ment and its predecessor (Lee, Wallbrown, & Blaha, 1990). Most recently, Gignac (2006) examined the factor structure of the MAB-II using a series of con- firmatory factor analyses with data on 3,121 individ- uals reported in Jackson (1998). The best fit to the

close.” However, a highly trained professional needs about 11/2 hours just to administer the Wechsler adult test to a single person. Because professional time is at a premium, a complete Wechsler intelli- gence assessment—including administration, scor- ing, and report writing—easily can cost hundreds of dollars. Many examiners have long suspected that an appropriate group test, with the attendant advan- tages of objective scoring and computerized narra- tive report, could provide an equally valid and much less expensive alternative to individual testing for most persons.

The MAB-II was designed to produce subtests and factors parallel to the WAIS-R but employing a multiple-choice format capable of being computer scored. The apparent goal in designing this test was to produce an instrument that could be adminis- tered to dozens or hundreds of persons by one ex- aminer (and perhaps a few proctors) with minimal training. In addition, the MAB-II was designed to yield IQ scores with psychometric properties simi- lar to those found on the WAIS-R. Appropriate for examinees from ages 16 to 74, the MAB-II yields 10 subtest scores, as well as Verbal, Performance, and Full Scale IQs.

Although it consists of original test items, the MAB-II is mainly a sophisticated subtest-by-subtest clone of the WAIS-R. The 10 subtests are listed as follows:

Verbal Performance

Information Digit Symbol Comprehension Picture Completion Arithmetic Spatial Similarities Picture Arrangement Vocabulary Object Assembly

The reader will notice that Digit Span from the WAIS-R is not included on the MAB-II. The reason for this omission is largely practical: There would be no simple way to present a Digit-Span-like subtest in paper-and-pencil format. In any case, the omission is not serious. Digit Span has the lowest correlation with overall WAIS-R IQ, and it is widely recognized that this subtest makes a minimal contribution to the measurement of general intelligence.

The only significant deviation from the WAIS-R is the replacement of Block Design with a

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 213

Intelligence factor independent of its contribution to the general factor.

Other researchers have noted the strong con- gruence between factor analyses of the WAIS-R (with Digit Span removed) and the MAB. Typically,

data was provided by a nested model consisting of a first-order general factor, a first-order Verbal Intel- ligence factor, and a first-order Performance Intel- ligence factor. The one caveat of this study was that Arithmetic did not load specifically on the Verbal

FIGURE 6.1 Demonstration Items from Three Performance Tests of the Multidimensional

Aptitude Battery-II (MAB)

Source: Reprinted with permission from Jackson, D. N. (1984a). Manual for the Multidimensional

Aptitude Battery. Port Huron, MI: Sigma Assessment Systems, Inc. (800) 265–1285.

214 Chapter 6 • Ability Testing: Group Tests and Controversies

The MAB-II shows great promise in research, career counseling, and personnel selection. In ad- dition, this test could function as a screening instrument in clinical settings, as long as the exam- iner views low scores as a basis for follow-up testing with an individual intelligence test. Examiners must keep in mind that the MAB-II is a group test and, therefore, carries with it the potential for misuse in individual cases. The MAB-II should not be used in isolation for diagnostic decisions or for placement into programs such as classes for intellectually gifted persons.

A Multilevel Battery: The Cognitive Abilities Test (CogAT)

One important function of psychological testing is to assess students’ abilities that are prerequisite to traditional classroom-based learning. In designing tests for this purpose, the psychometrician must contend with the obvious and nettlesome problem that school-aged children differ hugely in their intel- lectual abilities. For example, a test appropriate for a sixth grader will be much too easy for a tenth grader, yet impossibly difficult for a third grader.

The answer to this dilemma is a multilevel battery, a series of overlapping tests. In a multi- level battery, each group test is designed for a spe- cific age or grade level, but adjacent tests possess some common content. Because of the overlapping content with adjacent age or grade levels, each test possesses a suitably low floor and high ceiling for proper assessment of students at both extremes of ability. Virtually every school system in the United States uses at least one nationally normed multilevel battery.

The Cognitive Abilities Test (CogAT) is one of the best school-based test batteries in current use (Lohman & Hagen, 2001). A recent revision of the test is the CogAT Multilevel Edition, Form 6, re- leased in 2001. Norms for 2005 also are available. We discuss this instrument in some detail.

The CogAT evolved from the Lorge-Thorndike Intelligence Tests, one of the first group tests of

separate Verbal and Performance factors emerge for both tests (Wallbrown, Carmin, & Barnett, 1988). In a large sample of inmates, Ahrens, Evans, and Barnett (1990) observed validity-confirming changes in MAB scores in relation to education level. In general, with the possible exception that Arithmetic does not con- tribute reliably to the Verbal factor, there is good justi- fication for the use of separate Verbal and Performance scales on this test.

In general, the validity of this test rests upon its very strong physical and empirical resemblance to its parent test, the WAIS-R. Correlational data be- tween MAB and WAIS-R scores are crucial in this regard. For 145 persons administered the MAB and WAIS-R in counterbalanced fashion, correlations between subtests ranged from .44 (Spatial/Block Design) to .89 (Arithmetic and Vocabulary), with a median of .78. WAIS-R and MAB IQ correlations were very healthy, namely, .92 for Verbal IQ, .79 for Performance IQ, and .91 for Full Scale IQ (Jackson, 1984a). With only a few exceptions, correlations between MAB and WAIS-R scores exceed those be- tween the WAIS and the WAIS-R. Carless (2000) reported a similar, strong overlap between MAB scores and WAIS-R scores in a study of 85 adults for the Verbal, Performance, and Full Scale IQ scores. However, she found that 4 of the 10 MAB subtests did not correlate with the WAIS-R subscales they were designed to represent, suggesting caution in using this instrument to obtain detailed information about specific abilities.

Chappelle et al. (2010) obtained MAB-II scores for military personnel in an elite training program for AC-130 gunship operators. The officers who passed training (N = 59) and those who failed training (N = 20) scored above average (mean Full Scale IQs of 112.5 and 113.6, respectively), but there were no significant differences between the two groups on any of the test indices. This is a curious result insofar as IQ typically demonstrates at least mild predictive potential for real world vo- cational outcomes. Further research on the MAB- II as a predictor of real world results would be desirable.

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 215

Quantitative Battery appraise quantitative skills important for mathematics and other disciplines. The Nonverbal Battery can be used to estimate cognitive level of students with limited reading skill, poor English proficiency, or inadequate edu- cational exposure.

For each CogAT subtest, items are ordered by difficulty level in a single test booklet. However, entry and exit points differ for each of eight over- lapping levels (A through H). In this manner, grade- appropriate items are provided for all examinees.

The subtests are strictly timed, with limits that vary from 8 to 12 minutes. Each of the three batteries can be administered in less than an hour. However, the manual recommends three successive testing days for younger children. For older children, two batteries should be administered the first day, with a single testing period the next.

intelligence intended for widespread use within school systems. The CogAT is primarily a measure of scholastic ability but also incorporates a nonver- bal reasoning battery with items that bear no direct relation to formal school instruction. The two pri- mary batteries, suitable for students in kindergarten through third grade, are briefly discussed at the end of this section. Here we review the multilevel edi- tion intended for students in 3rd through 12th grade.

The nine subtests of the multilevel CogAT are grouped into three areas: Verbal, quantitative, and nonverbal, each including three subtests. Representative items for the subtests of the Co- gAT are depicted in Figure 6.2. The tests on the Verbal Battery evaluate verbal skills and reason- ing strategies (inductive and deductive) needed for effective reading and writing. The tests on the

Verbal Battery

1. Verbal Classification

Circle the item below that belongs with these three:

milk butter cheese

A. eggs B. yogurt C. grocery

D. bacon E. recipe

2. Sentence Completion

Circle the word below that best completes this sentence:

Fish _____________ in the ocean.

A. sit B. next C. fly

D. swim E. climb

3. Verbal Analogies

Circle the word that best fits this analogy:

Right S Left : Top S

A. Side B. Out C. Wrong

D. On E. Bottom

Quantitative Battery

4. Quantitative Relations

Circle the choice that depicts the relationship between

I and II:

I. 6/2 + 1

II. 9/3 − 1

A. I is greater than II B. I is equal to II

C. I is less than II

5. Number Series

Circle the number below that comes next in this series:

1 11 6 16 11 21 16

A. 31 B. 16 C. 26 D. 6 E. 11

6. Equation Building

Circle the choice below that could be derived from these:

1 2 4 + −

A. −1 B. 7 C. 0 D. 1 E. −3

FIGURE 6.2 Subtests and Representative Items of the Cognitive Abilities Test, Form 6

216 Chapter 6 • Ability Testing: Group Tests and Controversies

Nonverbal Battery

7. Figure Classification

Circle the item below that belongs with these three figures:

A B C D E

8. Figure Analogies

Circle the figure below that best fits with this analogy:

A B C D E

9. Figure Analysis

Circle the choice below that fits this paper folding and hole punching:

A B C D E

Note: These items resemble those on the CogAT 6. Correct answers: 1: B. yogurt (the only dairy product). 2: D. swim

(fish swim in the ocean). 3: E. bottom (the opposite of top). 4: A. I is greater than II (4 is greater than 2). 5: C. 26 (the

algorithm is add 10, subtract 5, add 10 . . .). 6: A. −1 (the only answer that fits) 7: A (four-sided shape that is filled in).

8: D (same shape, bigger to smaller). 9: E (correct answer).

FIGURE 6.2 continued

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 217

Ansorge (1985) has questioned whether all three batteries are really necessary. He points out that correlations among the Verbal, Quantitative, and Nonverbal batteries are substantial. The median values across all grades are as follows:

Verbal and Quantitative .78 Nonverbal and Quantitative .78 Verbal and Nonverbal .72

Since the Quantitative battery offers little unique- ness, from a purely psychometric point of view there is no justification for including it. Nonetheless, the test authors recommend use of all batteries in hopes that differences in performance will assist teachers in remedial planning. However, the test authors do not make a strong case for doing this.

A study by Stone (1994) provides a notable justification for using the CogAT as a basis for stu- dent evaluation. He found that CogAT scores for 403 third graders provided an unbiased prediction of student achievement that was more accurate than teacher ratings. In particular, teacher ratings showed bias against Caucasian and Asian Ameri- can students by underpredicting their achievement scores.

Raven’s Progressive Matrices (RPM)

First introduced in 1938, Raven’s Progressive Matrices (RPM) is a nonverbal test of inductive reasoning based on figural stimuli (Raven, Court, & Raven, 1986, 1992). This test has been very popular in basic research and is also used in some institutional set- tings for purposes of intellectual screening.

RPM was originally designed as a measure of Spearman’s g factor (Raven, 1938). For this reason, Raven chose a special format for the test that pre- sumably required the exercise of g. The reader is re- minded that Spearman defined g as the “eduction of correlates.” The term eduction refers to the process of figuring out relationships based on the perceived fundamental similarities between stimuli. In partic- ular, to correctly answer items on the RPM, examin- ees must identify a recurring pattern or relationship between figural stimuli organized in a 3 × 3 matrix. The items are arranged in order of increasing diffi- culty, hence the reference to progressive matrices.

Raw scores for each battery can be trans- formed into an age-based normalized standard score with mean of 100 and standard deviation of 15. In addition, percentile ranks and stanines for age groups and grade level are also available. Interpola- tion was used to determine fall, winter, and spring grade-level norms.

The CogAT was co-normed ( standardized concurrently) with two achievement tests, the Iowa Tests of Basic Skills and the Iowa Tests of Educational Development. Concurrent standardiza- tion with achievement measures is a common and desirable practice in the norming of multilevel intel- ligence tests. The particular virtue of joint norming is that the expected correspondence between intel- ligence and achievement scores is determined with great precision. As a consequence, examiners can more accurately identify underachieving students in need of remediation or further assessment for po- tential learning disability.

The reliability of the CogAT is exceptionally good. In previous editions, the Kuder-Richardson-20 reliability estimates for the multilevel batteries av- eraged .94 (Verbal), .92 (Quantitative), and .93 (Nonverbal) across all grade levels. The six-month test–retest reliabilities for alternate forms ranged from .85 to .93 (Verbal), .78 to .88 (Quantitative), and .81 to .89 (Nonverbal).

The manual provides a wealth of information on content, criterion-related, and construct validity of the CogAT; we summarize only the most perti- nent points here. Correlations between the CogAT and achievement batteries are substantial. For ex- ample, the CogAT verbal battery correlates in the .70s to .80s with achievement subtests from the Iowa Tests of Basic Skills.

The CogAT batteries predict school grades reasonably well. Correlations range from the .30s to the .60s, depending on grade level, sex, and eth- nic group. There does not appear to be a clear trend as to which battery is best at predicting grade point average. Correlations between the CogAT and indi- vidual intelligence tests are also substantial, typically ranging from .65 to .75. These findings speak well for the construct validity of the CogAT insofar as the Stanford-Binet is widely recognized as an excellent measure of individual intelligence.

218 Chapter 6 • Ability Testing: Group Tests and Controversies

reliability coefficients of .80 to .93 are typical. How- ever, for preteen children, reliability coefficients as low as .71 are reported. Thus, for younger subjects, RPM may not possess sufficient reliability to war- rant its use for individual decision making.

Factor-analytic studies of the RPM provide little, if any, support for the original intention of the test to measure a unitary construct (Spearman’s g factor). Studies of the Coloured Progressive Matrices reveal three orthogonal factors (e.g., Carlson & Jensen, 1980). Factor I consists largely of very diffi- cult items and might be termed closure and abstract reasoning by analogy. Factor II is labeled pattern completion through identity and closure. Factor III consists of the easiest items and is defined as simple pattern completion (Carlson & Jensen, 1980). In sum, the very easy and the very hard items on the Coloured Progressive Matrices appear to tap differ- ent intellectual processes.

The Advanced Progressive Matrices breaks down into two factors that may have separate pre- dictive validities (Dillon, Pohlmann, & Lohman, 1981). The first factor is composed of items in which the solution is obtained by adding or subtracting patterns (Figure 6.3a). Individuals performing well on these items may excel in rapid decision making and in situations where part–whole relationships must be perceived. The second factor is composed of items in which the solution is based on the abil- ity to perceive the progression of a pattern (Figure 6.3b). Persons who perform well on these items may possess good mechanical ability as well as good skills for estimating projected movement and performing mental rotations. However, the skills represented by each factor are conjectural at this point and in need of independent confirmation.

A huge body of published research bears on the validity of the RPM. The early data are well summarized by Burke (1958), while later findings are compiled in the current RPM manuals (Raven & Summers, 1986; Raven, Court, & Raven, 1983, 1986, 1992). In general, validity coefficients with achieve- ment tests range from the .30s to the .60s. As might be expected, these values are somewhat lower than found with more traditional (verbally loaded) in- telligence tests. Validity coefficients with other intelligence tests range from the .50s to the .80s.

Raven’s test is actually a series of three differ- ent instruments. Much of the confusion about valid- ity, factorial structure, and the like stems from the unexamined assumption that all three forms should produce equivalent findings. The reader is encour- aged to abandon this unwarranted hypothesis. Even though the three forms of the RPM resemble one another, there may be subtle differences in the prob- lem-solving strategies required by each.

The Coloured Progressive Matrices is a 36- item test designed for children from 5 to 11 years of age. Raven incorporated colors into this version of the test to help hold the attention of the young chil- dren. The Standard Progressive Matrices is normed for examinees from 6 years and up, although most of the items are so difficult that the test is best suited for adults. This test consists of 60 items grouped into 5 sets of 12 progressions. The Advanced Progressive Matrices is similar to the Standard version but has a higher ceiling. The Advanced version consists of 12 problems in Set I and 36 problems in Set II. This form is especially suitable for persons of superior intellect.

Large sample U.S. norms for the Coloured and Standard Progressive Matrices are reported in Raven and Summers (1986). Separate norms for Mexican American and African American children are included. Although there was no attempt to use a stratified random-sampling procedure, the selec- tion of school districts was so widely varied that the American norms for children appear to be reason- ably sound. Sattler (1988) summarizes the relevant norms for all versions of the RPM. Raven, Court, and Raven (1992) produced new norms for the Standard Progressive Matrices, but Gudjonsson (1995) has raised a concern that these data are com- promised because the testing was not monitored.

For the Coloured Progressive Matrices, split- half reliabilities in the range of .65 to .94 are reported, with younger children producing lower values (Ra- ven, Court, & Raven, 1986). For the Standard Pro- gressive Matrices, a typical split-half reliability is .86, although lower values are found with younger sub- jects (Raven, Court, & Raven, 1983). Test–retest reli- abilities for all three forms vary considerably from one sample to the next ( Raven, 1965; Raven et al., 1986). For normal adults in their late teens or older,

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 219

international popularity of the test, Khaleefa and Lynn (2008) provide standardization data for 6- to 11-year-old children in Yemen.

Even though the RPM has not lived up to its original intentions of measuring Spearman’s g factor, the test is nonetheless a useful index of nonverbal, figural reasoning. The recent updating of norms was a much-welcomed development for this well-known test, in that many American us- ers were leary of the outdated and limited British norms. Nonetheless, adult norms for the Standard and Advanced Progressive Matrices are still quite limited.

The RPM is particularly valuable for the supplemental testing of children and adults with hearing, language, or physical disabilities. Often these examinees are difficult to assess with tradi- tional measures that require auditory attention, verbal expression, or physical manipulation. In contrast, the RPM can be explained through pan- tomime, if necessary. Moreover, the only output re- quired of the examinee is a pencil mark or gesture denoting the chosen alternative. For these reasons, the RPM is ideally suited for testing persons with limited command of the English language. In fact, the RPM is about as culturally reduced as possible: The test protocol does not contain a single word in any language. Mills and Tissot (1995) found that the Advanced Progressive Matrices identified a higher proportion of minority children as gifted than did a more traditional measure of academic aptitude (the School and College Ability Test).

Bilker, Hansen, Brensinger, and others (2012) developed a psychometrically sound 9-item version of the 60-item Standard Progressive Matrices (SPM) test. The short test cuts testing time to a fraction of the full test. Correlations of scores on the 9-item ver- sion with the full scale were in the range of .90 to .98, indicating a minimal loss of measurement accuracy. The short SPM promises to be highly useful for re- search applications.

Perspective on Culture-Fair Tests

Cattell’s Culture Fair Intelligence Test (CFIT) and Raven’s Progressive Matrices (RPM) often are cited as examples of culture-fair tests, a concept with a

Also, as might be expected, the correlations tend to be higher with performance than with ver- bal tests. In a massive study involving thousands of schoolchildren, Saccuzzo and Johnson (1995) concluded that the Standard Progressive Matri- ces and the WISC-R showed approximately equal predictive validity and no evidence of differential validity across eight different ethnic groups. In a lengthy review, Raven (2000) discusses stability and variation in the norms for the Raven’s Progres- sive Matrices across cultural, ethnic, and socioeco- nomic groups over the last 60 years. Indicative of the continuing interest in this venerable instru- ment, Costenbader and Ngari (2001) describe the standardization of the Coloured Progressive Matrices in Kenya. Further indicating the huge

FIGURE 6.3 Raven’s Progressive Matrices: Typical Items

1 2 3 4

5 6 7 8

(a)

(b) 1 2 3 4

5 6 7 8

220 Chapter 6 • Ability Testing: Group Tests and Controversies

aptitude test batteries, the Primary Mental Abilities Test, a set of seven tests chosen on the basis of factor analysis (Thurstone, 1938).

More recently, several multiple aptitude test batteries have gained favor for educational and ca- reer counseling, vocational placement, and armed services classification (Gregory, 1994a). Each year hundreds of thousands of persons are administered one of these prominent batteries: the Differential Aptitude Test (DAT), the General Aptitude Test Battery (GATB), and the Armed Services Vocational Aptitude Battery (ASVAB). These bat- teries either used factor analysis directly for the de- lineation of useful subtests or were guided in their construction by the accumulated results of other factor-analytic research. The salient characteristics of each battery are briefly reviewed in the following sections.

The Differential Aptitude Test (DAT)

The DAT was first issued in 1947 to provide a basis for the educational and vocational guidance of students in grades 7 through 12. Subsequently, ex- aminers have found the test useful in the vocational counseling of young adults out of school and in the selection of employees. Now in its fifth edition (1992), the test has been periodically revised and stands as one of the most popular multiple aptitude test batteries of all time (Bennett, Seashore, & Wes- man, 1982, 1984). Wang (1995) provides a succinct overview of the test.

The DAT consists of eight independent tests:

1. Verbal Reasoning (VR) 2. Numerical Reasoning (NR) 3. Abstract Reasoning (AR) 4. Perceptual Speed and Accuracy (PSA) 5. Mechanical Reasoning (MR) 6. Space Relations (SR) 7. Spelling (S) 8. Language Usage (LU)

A characteristic item from each test is shown in Figure 6.4.

The authors chose the areas for the eight tests based on experimental and experiential data rather than relying on a formal factor analysis of their own.

long and confused history. We will attempt to clarify terms and issues here.

The first point to make is that intelligence tests are merely samples of what people know and can do. We must not reify intelligence and overvalue intelligence tests. Tests are never samples of innate intelligence or culture-free knowledge. All knowl- edge is based in culture and acquired over time. As Scarr (1994) notes, there is no such thing as a cul- ture-free test.

But what about a culture-fair test, one that poses problems that are equally familiar (or unfa- miliar) to all cultures? This would appear to be a more realistic possibility than a culture-free test, but even here the skeptic can raise objections. Consider the question of what a test means, which differs from culture to culture. In theory, a test of matrices would appear to be equally fair to most cultures. But in practice, issues of equity arise. Persons reared in Western cultures are trained in linear, convergent thinking. We know that the pur- pose of a test is to find the single, best answer and to do so quickly. We examine the 3 × 3 matrix from left to right and top to bottom, looking for the logi- cal principles invoked in the succession of forms. Can we assume that persons reared in Nepal or New Guinea or even the remote, rural stretches of Idaho will do the same? The test may mean some- thing different to them. Perhaps they will approach it as a measure of aesthetic progression rather than logical succession. Perhaps they will regard it as so much silliness not worthy of intense intellec- tual effort. To assume that a test is equally fair to all cultural groups merely because the stimuli are equally familiar (or unfamiliar) is inappropriate. We can talk about degrees of cultural fairness (or unfairness), but the notion that any test is absolutely culture-fair surely is mistaken.

MULTIPLE APTITUDE TEST BATTERIES

In a multiple aptitude test battery, the examinee is tested in several separate, homogeneous aptitude areas. Typically, the development of the subtests is dictated by the findings of factor analysis. For ex- ample, Thurstone developed one of the first multiple

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 221

FIGURE 6.4 Differential Aptitude Tests and Characteristic Items

A B C D

BA C (equal)

VERBAL REASONING

Choose the correct pair of words to fill in the blanks.

is to eye as eardrum is to

A. vision sound D. sight cochlea

B. iris hear E. eyelash earlobe

C. retina ear

NUMERICAL ABILITY

Choose the correct answer.

4(–5) (–3) =

A. –60 B. 27 C. –27 D. 60 E. none of these

ABSTRACT REASONING

The four figures in the row to the left make a series. Find the single choice on the right

that would be next in the series.

A B C D

CLERICAL SPEED AND ACCURACY

In each test item, one of the combinations is underlined. Mark the same combination on the

answer sheet.

1. AB Ab AA BA Bb 2. 5m 5M M5 Mm m5

Ab Bb AA BA AB M5 m5 Mm 5m 5M

1. O O O O O 2. O O O O O

MECHANICAL REASONING

Which lever will require more force to lift an object of the same weight? If equal, mark C.

SPACE RELATIONS

Which of the figures on the right can be made by folding the pattern at the left? The pattern

always displays the outside of the figure.

< <>> <<>> <<>>>> <> <<<>> <<<>>>> <<<<>>>>

—

?????? ? ??????

222 Chapter 6 • Ability Testing: Group Tests and Controversies

and alternate-forms reliabilities ranging from .73 to .90, with a median of .83. Mechanical Reasoning is an exception, with reliabilities as low as .70 for girls. The tests show a mixed pattern of intercor- relations with each other, which is optimistically interpreted by the authors as establishing the in- dependence of the eight tests. Actually, many of the correlations are quite high and it seems likely that the eight tests reflect a smaller number of abil- ity factors. Certainly, the Verbal Reasoning and Numerical Reasoning tests measure a healthy gen- eral factor, with correlations around .70 in various samples.

The manual presents extensive data dem- onstrating that the DAT tests, especially the VR + NR combination, are good predictors of other cri- teria such as school grades and scores on other ap- titude tests (correlations in the .60s and .70s). For this reason, the combination of VR + NR often is considered an index of scholastic aptitude. Evidence for the differential validity of the other tests is rather slim. Bennett, Seashore, and Wesman (1974) do present results of several follow-up studies corre- lating vocational entry/success with DAT profiles, but their research methods are more impressionis- tic than quantitative; the independent observer will find it difficult to make use of their results. Schmitt (1995) notes that a major problem with the battery is the

lack of discriminant validity between the eight subtests. With the exception of the Perceptual Speed and Accuracy test, all of the subscales

In constructing the DAT, the authors were guided by several explicit criteria:

• Each test should be an independent test: There are situations in which only part of the battery is required or desired.

• The tests should measure power: For most vo- cational purposes to which test results contrib- ute, the evaluation of power—solving difficult problems with adequate time—is of primary concern.

• The test battery should yield a profile: The eight separate scores can be converted to percentile ranks and plotted on a common profile chart.

• The norms should be adequate: In the fifth edition, the norms are derived from 100,000 students for the fall standardization, 70,000 for the spring standardization.

• The test materials should be practical: With time limits of 6 to 30 minutes per test, the en- tire DAT can be administered in a morning or an afternoon school session.

• The tests should be easy to administer: Each test contains excellent “warm-up” examples and can be administered by persons with a minimum of special training.

• Alternate forms should be available: For pur- poses of retesting, the availability of alternate forms (currently forms C and D) will reduce any practice effects.

The reliability of the DAT is generally quite high, with split-half coefficients largely in the .90s

SPELLING

Mark whether each word is spelled right or wrong.

1. irelevant

2. parsimonious

3. excellant

LANGUAGE USAGE

Decide which part of the sentence contains an error and mark the corresponding letter on

the answer sheet. Mark N (None) if there is no error.

In spite of public criticism, / the researcher studied /

the affects of radiation / on plant growth. A B

C D

R W

FIGURE 6.4 continued

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 223

outcome of this Herculean effort was the General Aptitude Test Battery (GATB), widely acknowl- edged as the premiere test battery for predicting job performance (Hunter, 1994).

The GATB was derived from a factor analysis of 59 tests administered to thousands of male train- ees in vocational courses (United States Employ- ment Service, 1970). The interpretive standards have been periodically revised and updated, so the GATB is a thoroughly modern instrument even though its content is little changed. One limitation is that the battery is available mainly to state employment of- fices, although nonprofit organizations, including high schools and certain colleges, can make special arrangements for its use.

The GATB is composed of eight paper-and- pencil tests and four apparatus measures. The entire battery can be administered in approximately two- and-a-half hours and is appropriate for high school seniors and adults. The 12 tests yield a total of nine factor scores:

• General Learning Ability (intelligence) (G). This score is a composite of Vocabulary, Arith- metic Reasoning, and Three-Dimensional Space.

• Verbal Aptitude (V). Derived from a Vocabu- lary test that requires the examinee to indicate which two words in a set are either synonyms or antonyms.

• Numerical Aptitude (N). This score is a com- posite of both the Computation and Arithme- tic Reasoning tests.

• Spatial Aptitude (S). Consists of the Three- Dimensional Space test, a measure of the ability to perceive two-dimensional representations of three-dimensional objects and to visualize movement in three dimensions.

• Form Perception (P). This score is a compos- ite of Form Matching and Tool Matching, two tests in which the examinee must match iden- tical drawings.

• Clerical Perception (Q). A proofreading test called Name Comparison, the examinee must match names under pressure of time.

• Motor Coordination (K). Measures the ability to quickly make specified pencil marks in the Mark Making test.

are highly intercorrelated (.50 to .75). If one wants only a general index of the person’s academic ability, this is fine; if the scores on the subtests are to be used in some diagnos- tic sense, this level of intercorrelation makes statements about students’ relative strengths and weaknesses highly questionable.

Even so, the revised DAT is better than previous edi- tions. One significant improvement is the elimina- tion of apparent sex bias on the Language Usage and Mechanical Reasoning tests—a source of criticism from earlier reviews. The DAT has been translated into several languages and is widely used in Europe for vocational guidance and research applica- tions (e.g., Nijenhuis, Evers, & Mur, 2000; Colom, Quiroga, & Juan-Espinosa, 1999).

A computerized version of the DAT has been available for several years, although its equivalence to the traditional paper and pencil format cannot be taken for granted (Alkhadher, Clarke, & Ander- son, 1998). We will have more to say about com- puterized testing in a later section of the book. For now, it will suffice to mention that the psychomet- ric qualities of a test may shift when the mode of administration is changed. Using counterbalanced testing in which examinees completed both ver- sions (half taking the traditional version first, half taking the computerized version first), Alkhadher et al. (1998) found that oil refinery trainees (N = 122) scored higher on one subtest of the computer- ized version than on the traditional version of the DAT, namely, the Numerical Ability subtest. The researchers conjectured that the computerized ver- sion reduced test fatigue, alleviated time pressure, and also provided novelty—thus boosting test per- formance modestly.

The General Aptitude Test Battery (GATB)

In the late 1930s, the U.S. Department of Labor de- veloped aptitude tests to predict job performance in 100 specific occupations. In the 1940s, the de- partment hired a panel of experts in measurement and industrial-organizational psychology to cre- ate a multiple aptitude test battery to assess the 100 occupations previously studied and many more. The

224 Chapter 6 • Ability Testing: Group Tests and Controversies

The nine specific factor scores combine nicely into three general factors: Cognitive, Perceptual, and Psychomotor. Hunter notes that different jobs require various contributions of the Cognitive, Perceptual, and Psychomotor aptitudes. For exam- ple, an assembly line worker in an automotive plant might need high scores on the Psychomotor and Perceptual composites, whereas the Cognitive score would be less important for this occupation. Hunter’s research demonstrates that general factors dominate over specific factors in the prediction of job perfor- mance. Davison, Gasser, and Ding (1996) discuss additional approaches to GATB profile analysis and interpretation.

Van de Vijver and Harsveld (1994) investi- gated the equivalence of their computerized version of the GATB with the traditional paper-and-pencil version. Of course, only the cognitive and percep- tual subtests were compared—tests of motor skills cannot be computerized. They found that the two versions were not equivalent. In particular, the com- puterized subtests produced faster and more inaccu- rate responses than the conventional subtests. Their research demonstrates once again that the equiva- lence of traditional and computerized versions of a test should not be assumed. This is an empirical question answerable only with careful research. Nijenhuis and van der Flier (1997) discuss a Dutch version of the GATB and its application in the study of cognitive differences between immigrants and majority group members in the Netherlands.

• Finger Dexterity (F). A composite of the Assemble and Disassemble tests, two mea- sures of dexterity with rivets and washers.

• Manual Dexterity (M). A composite of Place and Turn, two tests requiring the examinee to transfer and reverse pegs in a board.

The nine factor scores on the GATB are ex- pressed as standard scores with a mean of 100 and an SD of 20. These standard scores are anchored to the original normative sample of 4,000 workers ob- tained in the 1940s. Alternate-forms reliability coef- ficients for factor scores range from the .80s to the .90s. The GATB manual summarizes several studies of the validity of the test, primarily in terms of its correlation with relevant criterion measures. Hunter (1994) notes that GATB scores predict training suc- cess for all levels of job complexity. The average va- lidity coefficient is a phenomenal .62.

The absolute scores are of less interest than their comparison to updated Occupational Aptitude Patterns (OAPs) for dozens of occupations. Based on test results for huge samples of applicants and employees in different occupations, counselors and employers now have access to a wealth of informa- tion about score patterns needed for success in a va- riety of jobs. Thus, one way of using the GATB is to compare an examinee’s scores with OAPs believed necessary for proficiency in various occupations.

Hunter (1994) recommends an alternative strategy based on composite aptitudes (Figure 6.5).

FIGURE 6.5 Specific and General Factors on the GATB

SPECIFIC FACTORS GENERAL FACTORS

G General Learning Ability (intelligence)

V Verbal Aptitude Cognitive

N Numerical Aptitude

S Spatial Aptitude

P Form Perception Perceptual

Q Clerical Perception

K Motor Coordination

F Finger Dexterity Psychomotor

M Manual Dexterity

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 225

might be assigned to electronics-related positions. Since the composite scores are empirically derived, new ones can be developed for placement decisions at any time. Composite scores are continually up- dated and revised.

At one point, the Armed Services relied heavily on the seven composites in the following list (Mur- phy, 1984). The Coding Speed subtest, listed here, is no longer used. The first three constitute academic composites, whereas the remaining are occupational composites. The reader will notice that individual subtests may appear in more than one composite:

1. Academic Ability: Word Knowledge, Paragraph Comprehension, and Arithmetic Reasoning

2. Verbal: Word Knowledge, Paragraph Compre- hension, and General Science

3. Math: Mathematics Knowledge and Arithmetic Reasoning

4. Mechanical and Crafts: Arithmetic Reasoning, Mechanical Comprehension, Auto and Shop Information, and Electronics Information

5. Business and Clerical: Word Knowledge, Paragraph Comprehension, Mathematics Knowledge, and Coding Speed

6. Electronics and Electrical: Arithmetic Rea- soning, Mathematics Knowledge, Electronics Information, and General Science

7. Health, Social, and Technology: Word Knowl- edge, Paragraph Comprehension, Arithmetic Reasoning, and Mechanical Comprehension

The Armed Services Vocational Aptitude Battery (ASVAB)

The ASVAB is probably the most widely used apti- tude test in existence. This instrument is used by the Armed Services to screen potential recruits and to assign personnel to different jobs and training pro- grams. The ASVAB is also available in a computer- ized version that is rapidly supplanting the original paper-and-pencil test (Segall & Moreno, 1999). The computerized ASVAB is discussed in more detail at the end of this section. More than 2 million examin- ees take the ASVAB each year. The current version consists of nine subtests, four of which produce the Armed Forces Qualification Test (AFQT), the com- mon qualifying exam for all services (Table 6.1). Alternate-forms reliability coefficients for ASVAB scores are in the mid-.80s to mid-.90s, and test–retest coefficients range from the mid-.70s to the mid- .80s (Larson, 1994). The one exception is Paragraph Comprehension with a reliability of only .50. The test is well normed on a representative sample of 12,000 persons between the ages of 16 and 23 years. The ASVAB manual reports a median validity coef- ficient of .60 with measures of training performance.

Decisions about ASVAB examinees are typi- cally based on composite scores, not subtest scores. For example, an Electronics Composite is derived by combining Arithmetic Reasoning, Mathematics Knowledge, Electronics Information, and General Science. Persons scoring well on this composite

TABLE 6.1 The Armed Services Vocational Aptitude Battery (ASVAB) Subtests

Arithmetic Reasoning* 16-item test of arithmetic word problems based on simple calculation

Mathematics Knowledge* 25-item test of algebra, geometry, fractions, decimals, and exponents

Word Knowledge* 35-item test of vocabulary knowledge and synonyms

Paragraph Comprehension* 15-item test of reading comprehension in short paragraphs

General Science 25-item test of general knowledge in physical and biological science

Mechanical Comprehension 25-item test of mechanical and physical principles

Electronics Information 20-item test of electronics, radio, and electrical principles

Assembling Objects 16-item test of mechanical and assembly concepts

Auto and Shop 25-item test of basic knowledge of autos, shop practices, and tool usage

*Armed Forces Qualifying Test (AFQT).

226 Chapter 6 • Ability Testing: Group Tests and Controversies

overview here. In CAT, the examinee takes the test while sitting at a computer terminal. The difficulty level of the items presented on the screen is continu- ally readjusted as a function of the examinee’s on- going performance. In general, an examinee who answers a subtest item correctly will receive a harder item, whereas an examinee who fails that item will receive an easier item. The computer uses item re- sponse theory as a basis for selecting items. Each ex- aminee receives a unique set of test items tailored to his or her ability level.

In 1990, the CAT-ASVAB began to replace the paper-and-pencil ASVAB. Currently, more than two-thirds of all military applicants are tested with the computerized version. Larson (1994) lists the reasons for adopting the CAT-ASVAB as follows:

1. Shorten overall testing time (adaptive tests re- quire roughly one-half the items of standard tests).

2. Increase test security by eliminating the pos- sibility that test booklets could be stolen.

3. Increase test precision at the upper and lower ability extremes.

4. Provide a means for immediate feedback on test scores, since the computers used for test- ing can immediately score the tests and output the results.

5. Provide a means for flexible test start times (unlike group-administered paper-and-pencil tests, for which everyone must start and stop at the same time, computer-based testing can be tailored to the examinees’ personal sched- ules) (Larson, 1994).

Reliability and validity studies of the CAT- ASVAB provide strong support for its equivalence to the original test. In general, the computerized ver- sion of the instrument measures the same constructs as its paper-and-pencil counterpart—and does so in less time and with greater precision (Moreno & Segall, 1997). With the success of this project, the CAT-ASVAB and other tests likely will be expanded to measure new aspects of performance such as re- sponse latencies and to display unique item types such as visuospatial tests of objects in motion (Lar- son, 1994). The CAT-ASVAB has the potential to change the future of testing.

The problem with forming composites in this manner is that they are so highly correlated with one another as to be essentially redundant. In fact, the average intercorrelation among these seven composite scores is .86 (Murphy, 1984)! Clearly, composites do not always provide differential in- formation about specific aptitudes. Perhaps that is why recent editions of the ASVAB have steered clear of multiple, complex composites. Instead, the emphasis is on simpler composites that are com- posed of highly related constructs. For example, a Verbal Ability composite is derived from Word Knowledge and Paragraph Comprehension, two highly inter-related subtests. In like manner, a Math Ability composite is obtained from the com- bination of Arithmetic Reasoning and Mathematics Knowledge.

Some researchers have concluded that the ASVAB does not function as a multiple aptitude test battery but achieves success in predicting diverse vocational assignments because the composites in- variably tap a general factor of intelligence. For ex- ample, Dunai and Porter (2001) report favorably on the ASVAB as a predictor of entry-level success of radiography students in Air Force medical training. The ASVAB may be a good test of general intelli- gence, but it falls short as a multiple aptitude test battery. Another concern is that the test may pos- sess different psychometric structures for men and women. Specifically, the Electronics Information subtest is a good measure of g (the general factor of intelligence) for men but not women (Ree & Car- retta, 1995). The likely explanation for this is that men are about nine times more likely to enroll in high school classes in electronics and auto shop, and men, therefore, have the opportunity for their general ability to shape what they learn about elec- tronics information, whereas women do not. Scores on this subtest will, therefore, function as a measure of achievement (what has already been learned) but not as an index of aptitude (forecasting future results).

Research on a computerized adaptive test- ing (CAT) version of the ASVAB has been under way since the 1980s. Computerized adaptive testing is discussed in Topic 12B, Computerized Assess- ment and the Future of Testing. We provide a brief

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 227

paragraphs and then answering multiple-choice questions about the passages. The questions embody three approaches:

Vocabulary in Context—discerning the mean- ing of words from their context in the passage

Literal Comprehension—understanding sig- nificant information directly available in the passage

Extended Reasoning—following an argument or making inferences from the passage

Some questions in the Critical Reading sec- tion also engage a complex form of fill in the blanks. However, instead of testing for mere factual knowl- edge, the questions evaluate verbal comprehension. Here is a straightforward example:

Hoping to the dispute, the family ther- apist proposed a concession that he felt would be to both mother and daughter.

A. end . . . divisive B. overcome . . . unappealing C. protract . . . satisfactory D. resolve . . . acceptable E. enforce . . . useful

The correct answer is D. Of course, the SAT incorporates more difficult items of this genre.

PREDICTING COLLEGE PERFORMANCE

As most every college student knows, a major use of aptitude tests is the prediction of academic per- formance. In most cases, applicants to college must contend with the Scholastic Assessment Tests (SAT) or the American College Test (ACT) assessment program. Institutions may set minimum standards on the SAT or ACT tests for admission, based on the knowledge that low scores foretell college failure. In this section we will explore the technical adequacy and predictive validity of the major college aptitude tests.

The Scholastic Assessment Test (SAT)

Formerly known as the Scholastic Aptitude Tests, the Scholastic Assessment Test, or SAT, is the old- est of the college admissions tests, dating back to 1926. The SAT is published by the College Board (formerly the College Entrance Examination Board), a group formed in 1899 to provide a national clear- inghouse for admissions testing. As noted by histo- rian Fuess (1950), the purpose of a nationally based admissions test was “to introduce law and order into an educational anarchy which towards the close of the nineteenth century had become exasperating, indeed almost intolerable, to schoolmasters.” Over the years, the test has been extensively revised, con- tinuously updated, and repeatedly renormed. In the early 1990s, the SAT was renamed the Scholastic Assessment Test to emphasize changes in content and format. The new SAT assesses mastery of high school subject matter to a greater extent than its pre- decessor but continues to tap reasoning skills. The SAT represents state of the art for aptitude testing.

The new SAT, released in 2005, consists of the SAT Reasoning Test and the SAT Subject Tests. The SAT Reasoning Test is used for college admission decisions, whereas the optional SAT Subject Tests typically are needed for advanced college place- ment in fields such as Biology, Chemistry, History, Foreign Languages, and Mathematics. We restrict our discussion here to the SAT Reasoning Test. For ease of discussion, we refer to it simply as the “SAT.”

The SAT consists of three sections, each containing three or four subtests (Table 6.2). The Critical Reading section involves reading individual

TABLE 6.2 Sections and Subtests of the SAT Reasoning Test

Section Subtests

Critical Reading Extended Reasoning Literal Comprehension Vocabulary in Context

Math Numbers and Operations Algebra and Functions Geometry and Measurement Data Analysis, Statistics, and Probability

Writing Essay Improving Sentences Identifying Sentence Errors Improving Paragraphs

228 Chapter 6 • Ability Testing: Group Tests and Controversies

can lead to disappointment and frustration. If we want to be happy in what we do in life, we should not seek achievement for the sake of winning wealth and fame. The personal sat- isfaction of a job well done is its own reward.

Assignment: Are people motivated to achieve by personal satisfaction rather than by money or fame? Plan and write an essay in which you develop your point of view on this issue. Support your position with reasoning and examples taken from your reading, studies, experience, or observations. (College Board, 2005)

The essay is evaluated by two trained readers on a 1 to 6 scale, resulting in a total score of 2 to 12 for the Essay test. Students also receive a separate score on a scale from 20 to 80 for the multiple-choice por- tion of the Writing section. Both these scores are combined for the overall section score for Writing. SAT scores for each of the three sections—Critical Reading, Math, and Writing—are now reported on the familiar 200- to 800-point scale, with an approxi- mate mean of 500 and standard deviation of 100.

Great care is taken in the construction of new forms of the SAT because unfailing reliability and a high degree of parallelism are essential to the mission of this testing program. Historically, the internal consistency reliability of all sections is re- peatedly in the range of .91 to .93; with only a few exceptions, test–retest correlations vary between .87 and .89. The standard error of measurement is 30 to 35 points.

Frey and Detterman (2004) conducted a so- phisticated factor analytic study of the relationship between the SAT and g or general intelligence. Results for 917 youth who took the SAT and the ASVAB indi- cated a correlation of .82 between g (as extracted from ASVAB results) and SAT scores. They concluded that the SAT is an excellent measure of general cognitive ability.

The primary evidence for SAT validity is criterion-related, in this case, the ability to pre- dict first-year college grades. Donlon (1984, chap. VIII) reports a wealth of information on this point for earlier editions; we can only summarize trends here. In 685 studies, the combined SAT Verbal

The second part of the SAT is the Math sec- tion, consisting of three subtests. Collectively, these subtests assess basic math skills in algebra, geom- etry, statistics, and data analysis needed for success- ful navigation of college. Most of the questions are multiple-choice format, for example:

A special lottery was announced to select the student who will live in the only luxury apartment in student housing. In all, 50 juniors, 125 sophomores, and 175 freshmen applied. However, juniors were allowed to purchase 4 tickets each. What is the prob- ability that the room will be awarded to a junior?

A. 1/5 B. 1/2 C. 2/5 D. 1/7 E. 2/7

The correct answer is C. In addition to mul- tiple-choice questions, the Math section includes several items that require the student to generate a single correct answer and then enter it on the re- sponse sheet. For example:

What value of x satisfies both equations below?

x2 – 4 = 0

|4x + 6| = 2

The correct answer is –2. Strategies for finding a solution that might work with a multiple-choice question—trial and error, or process of elimination— are not likely to help with this style of question. Here the examinee must generate the correct answer by dint of careful analysis.

The Writing portion of the SAT now consists of a 25-minute Essay section and three multiple- choice subtests that evaluate the ability of the exam- inee to improve sentences, identify sentence errors, and improve paragraphs. In the Essay test, the ex- aminee reads a short excerpt and then writes a short paper that takes a point of view. Here is an example of an excerpt and assignment:

A sense of happiness and fulfillment, not per- sonal gain, is the best motivation and reward for one’s achievements. Expecting a reward of wealth or recognition for achieving a goal

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 229

test. The four ACT tests require knowledge of a sub- ject area, but emphasize the use of that knowledge:

• English (75 questions, 45 minutes). The exam- inee is presented with several prose passages excerpted from published writings. Certain portions of the text are underlined and num- bered, and possible revisions for the under- lined sections are presented; in addition, “no change” is one choice. The examinee must choose the best option.

• Mathematics (60 questions, 60 minutes). Here the examinee is asked to solve the kinds of mathematics problems likely to be encoun- tered in basic college mathematics courses. The test emphasizes concepts rather than for- mulas and uses a multiple-choice format.

• Reading (40 questions, 35 minutes). This sub- test is designed to assess the examinee’s level of reading comprehension; subscores are reported for social studies/sciences and arts/ literature reading skills.

• Science Reasoning (40 questions, 35 minutes). This test assesses the ability to read and un- derstand material in the natural sciences. The questions are drawn from data representa- tions, research summaries, and conflicting viewpoints.

In addition to the area scores listed previously, ACT results are also reported as an overall Compos- ite score, which is the average of the four tests. ACT scores are reported on a standard score 36-point scale. In 2012, the average ACT Composite score of high school graduates was 21.1, with a standard de- viation of about 5 points.

Critics of the ACT program have pointed to the heavy emphasis on reading comprehension that saturates all four tests. The average intercor- relation of the tests is typically around .60. These data suggest that a general achievement/ability fac- tor pervades all four tests; results for any one test should not be overinterpreted. Fortunately, college admission officers probably place the greatest em- phasis on the Composite score, which is the average of the four separate tests. The ACT test appears to measure much the same thing as the SAT; the cor- relation between these two tests approaches .90. It

and Math scores correlated .42, on average, with college first-year grade point average. Interestingly, high school record (e.g., rank or grade point aver- age) fares better than the SAT in predicting college grades (r = .48). But the combination of SAT and high school record proves even more predictive; these variables correlated .55, on average, with col- lege first-year grade point average. Of course, these findings reflect a substantial restriction of range: low SAT-scoring high school students tend not to attend college. Donlon (1984) estimated that the real correlation without restriction of range (SAT + high school record) would be in the neighborhood of .65. According to the College Board website, the combination of SAT and high school GPA contin- ues to provide a robust correlation (r = .62) with freshman grades. Based on a sample of 151,316 students attending 110 colleges and universities across the United States, these results leave no room for doubt as to the general predictive power of SAT scores (www.collegeboard.com). However, the results also show that for students whose best language is not English (e.g., children of recent im- migrants), the crucial reading and writing portions of the SAT underpredict freshman grades.

The American College Test (ACT)

The American College Test (ACT) assessment pro- gram is a recent program of testing and reporting designed for college-bound students. In addition to traditional test scores, the ACT assessment program includes a brief 90-item interest inventory (based on Holland’s typology) and a student profile section (in which the student may list subjects studied, notable accomplishments, work experience, and commu- nity service). We will not discuss these ancillary measures here, except to note that they are useful in generating the Student Profile Report, which is sent to the examinee and the colleges listed on the regis- tration folder.

Initiated in 1959, the ACT is based on the philosophy that direct tests of the skills needed in college courses provide the most efficient basis for predicting college performance. In terms of the number of students who take it, the ACT occupies second place behind the SAT as a college admissions

230 Chapter 6 • Ability Testing: Group Tests and Controversies

as it is to use test scores however high. There are talented students in many areas—leaders, organizers, doers, musicians, athletes, science award winners, opera buffs—who may have moderate or low ACT scores but whose pres- ence on a campus would change it.

The reader may wish to review Topic 6B, Test Bias and Other Controversies, for further discussion of this point.

POSTGRADUATE SELECTION TESTS

Graduate and professional programs also rely heavily on aptitude tests for admission decisions. Of course, many other factors are considered when selecting stu- dents for advanced training, but there is no denying the centrality of aptitude test results in the selection decision. For example, Figure 6.6 depicts a fairly typi- cal quantitative weighting system used in evaluat- ing applicants for graduate training in psychology. The reader will notice that an overall score on the

is not surprising, then, that the predictive validity of the ACT Composite score rivals the SAT combined score, with correlations in the vicinity of .40 to .50 with college first-year grade point average. The pre- dictive validity coefficients are virtually identical for advantaged and disadvantaged students, indicating that the ACT tests are not biased.

Kifer (1985) does not question the technical adequacy of the ACT and similar testing programs but does protest the enormous symbolic power these tests have accrued. The heavy emphasis on test scores for college admissions is not a technical issue, but a social, moral, and political concern:

Selective admissions means simply that an in- stitution cannot or will not admit each person who completes an application. Choices of who will or will not be admitted should be, first of all, a matter of what the institution believes is desirable and may or may not include the use of prediction equations. It is just as de- fensible to select on talent broadly construed

FIGURE 6.6 Representative Weighting Scheme Used by Graduate Program Admission

Committees in Psychology

GRE Scores 0 6 12 18 24 30

GRE-V + GRE-Q total: 1,000 1,100 1,200 1,300 1,400

Undergraduate GPA 0 5 10 15 20 25

3.0 3.2 3.4 3.6 3.8

Psychology GPA 0 1 2 3 4 5

3.0 3.2 3.4 3.6 3.9

Background in Statistics/Experimental 0 1 2 3 4 5

Background in Biology/Chemistry 0 1 2 3 4 5

Background in Math/Computer Science 0 1 2 3 4 5

Research Experience 0 1 2 3 4 5

Positive Interpersonal Skills 0 2 4 6 8 10

Ethnic/Linguistic/Cultural Diversity 0 2 4 6 8 10

Maximum Total: 100

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 231

new scaling metric represents a substantial change from the familiar GRE scale employed since the 1950s. Prior to 2012, the first two scores ( GRE-V and GRE-Q) were reported as standard scores with a mean of about 500 and standard deviation of 100 (range of 200 to 800). Actually, the mean scores shifted from year to year because all test results were anchored to a standard reference group of 2,095 col- lege seniors tested in 1952 on the verbal and quan- titative portions of the test. Historically, graduate programs have paid more attention to the first two parts of the test (GRE-V and GRE-Q). Recently, pro- grams have acknowledged the importance of writing skills in their applications, which explains the addi- tion of the analytical writing section (GRE-AW).

Scoring of the analytical writing section is based on 6-point holistic ratings provided indepen- dently by two trained raters. If the two scores differ by more than one point on the scale, the discrepancy is adjudicated by a third GRE-AW reader. According to the GRE Board (www.gre.org), the GRE-AW test reveals smaller ethnic group differences than found in the multiple-choice sections. For example, the dif- ferences between African American and Caucasian examinees and between Hispanic and Caucasian ex- aminees are smaller on the GRE-AW than on the GRE-V or GRE-Q. This suggests that the new test does not unduly penalize ethnic groups traditionally underrepresented in graduate programs.

The reliability of the GRE is strong, with in- ternal consistency reliability coefficients typically around .90 for the three components. The validity of the GRE commonly has been examined in relation to the ability of the test to predict performance in graduate school. Performance has been operational- ized mainly as grade point average, although faculty ratings of student aptitude also have been used. For example, based on a meta-analytic review of 22 stud- ies with a total of 5,186 students, Morrison and Mor- rison (1995) concluded that GRE-V correlated .28 and GRE-Q correlated .22 with graduate grade point average. Thus, on average, GRE scores accounted for only 6.3 percent of the variance in graduate-level aca- demic performance. In a recent study of 170 graduate students in psychology at Yale University, Sternberg and Williams (1997) also found minimal correlations between GRE scores and graduate grades. When

Graduate Record Exam (GRE) receives the single highest weighting in the selection process. We review the GRE in the following sections, as well as admis- sion tests used by medical schools and law schools.

Graduate Record Exam (GRE)

The GRE is a multiple-choice and essay test widely used by graduate programs in many fields as one component in the selection of candidates for ad- vanced training. The GRE offers subject exami- nations in many fields (e.g., Biology, Computer Science, History, Mathematics, Political Science, Psychology), but the heart of the test is the gen- eral test designed to measure verbal, quantitative, and analytical writing aptitudes. The verbal section (GRE-V) includes verbal items such as analogies, sentence completion, antonyms, and reading com- prehension. The quantitative section (GRE-Q) con- sists of problems in algebra, geometry, reasoning, and the interpretation of data, graphs, and diagrams. The analytical writing section (GRE-AW) was added in October 2002 as a measure of higher-level critical thinking and analytical writing skills. It consists of two writing tasks: A 30-minute essay in which the applicant analyzes an issue, and a 30-minute essay in which the applicant analyzes an argument. Here is an example of an issue question:

As people rely more and more on technology to solve problems, the ability of humans to think for themselves will surely deteriorate.

Discuss the extent to which you agree or disagree with the statement and explain your reasoning for the position you take. In de- veloping and supporting your position, you should consider ways in which the statement might or might not hold true and explain how these considerations shape your position. (www.ets.org/gre).

The argument questions entail reading a short paragraph that invokes an argument, and writing a critique of the argument.

Beginning in 2012, the first two scores ( GRE-V and GRE-Q) were reported as standard scores with a mean of about 150 and a range of 130 to 170. This

232 Chapter 6 • Ability Testing: Group Tests and Controversies

forms of student accomplishment. GRE general test scores were significantly associated with the follow- ing student outcomes: first-year GPA, overall GPA, comprehensive exam scores, faculty ratings, and publication citation counts. The researchers also discovered that the GRE Psychology subject test out- performed the general test as a predictive measure of student success.

Medical College Admission Test (MCAT)

The MCAT is required of applicants to almost all medical schools in the United States. The test is designed to assess achievement of the basic skills and concepts that are prerequisites for success- ful completion of medical school. There are three multiple-choice sections (Verbal Reasoning, Physi- cal Sciences, Biological Sciences) (40 questions). The Verbal Reasoning section is designed to evaluate the ability to understand and apply information and ar- guments presented in written form. Specifically, the test consists of several passages of about 500 to 600 words each, taken from humanities, social sciences, and natural sciences. Each passage is followed by several questions based on information included in the passage. The Physical Sciences section (52 ques- tions) is designed to evaluate reasoning in general chemistry and physics. The Biological Sciences sec- tion (52 questions) is designed to evaluate reasoning in biology and organic chemistry. These physical and biological science sections contain 10 to 11 problem sets described in about 250 words each, with several questions following.

Following the three required parts of the MCAT, an optional trial section of 32 questions is administered. This portion is not scored. The pur- pose of the trial section is to pretest questions for future exams. Some trial questions are designed for a new section of the MCAT, Psychological, Social, and Biological Foundations of Behavior, scheduled to com- mence in 2015. This new section will test knowledge of important concepts in introductory psychology, sociology, and biology, related to mental processes and behavior. The addition of this section acknowl- edges that effective doctors need to understand the whole person, including social and cultural determi- nants of health and health-related behaviors.

GRE scores were correlated with faculty ratings on five variables (analytical, creative, practical, research, and teaching abilities), the correlations were even lower, for the most part hovering right around zero. The single exception was the GRE analytical think- ing score, which correlated modestly with almost all of the faculty ratings. However, this correlation was observed only for men (on the order of r = .3), whereas for women it was almost exactly zero in ev- ery case! Based on these and similar studies, the con- sensus would appear to be that excessive reliance on the GRE for graduate school selection may overlook a talented pool of promising graduate students.

However, other researchers are more support- ive in their evaluation of the GRE, noting that the correlation of GRE scores and graduate grades is not a good index of validity because of the restriction of range problem (Kuncel, Campbell, & Ones, 1998). Specifically, applicants with low GRE scores are un- likely to be accepted for graduate training in the first place and, thus, relatively little information is avail- able with respect to whether low scores predict poor academic performance. Put simply, the correlation of GRE scores with graduate academic performance is based mainly on persons with middle to high levels of GRE scores, that is, GRE-V + GRE-Q totals of 1,000 and up. As such, the correlation will be attenuated pre- cisely because those with low GREs are not included in the sample. Another problem with validating the GRE against grades in graduate school is the unreli- ability of the criterion (grades). Based on the expecta- tion that graduate students will perform at high levels, some professors may give blanket A’s such that grades do not reflect real differences in student aptitudes. This would lower the correlation between the predic- tor (GRE scores) and the criterion (graduate grades). When these factors are accounted for, many research- ers find reason to believe the GRE is still a valid tool for graduate school selection (Powers, 2004).

In a comprehensive meta-analysis of 1,753 in- dependent groups of students, Kuncel, Hezlett, and Ones (2001) confirmed the validity of the GRE tests (Verbal, Quantitative, and Analytical) for the pre- diction of graduate student performance. The total sample size for their analysis was huge, including 82,659 students. The breadth of their investigation allowed them to code studies for several different

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 233

important jobs in American society because it is the lawyer’s job to make sure the law works and serves people. And if that is true, than the American legal profession is much too impor- tant to be left in the hands of a self-perpetuating elite. It has to be open to all Americans with the talent and ability to do legal work, no mat- ter how their last names are spelled or where they or their ancestors were born or the color of their skin (LaPiana, 1998, p. 12).

About 150,000 individuals take the LSAT each year. Of course, many other variables come into play in law school admissions, but test results probably are the single most important factor.

The LSAT is a half-day standardized test re- quired of applicants to virtually every law school in the United States. The test is designed to measure skills considered essential for success in law school, including the reading and understanding of com- plex material, the organization and management of information, and the ability to reason critically and draw correct inferences. The LSAT consists of multiple-choice questions in four areas: reading comprehension, analytical reasoning, and two logi- cal reasoning sections. An additional section is used to pretest new test items and to preequate new test forms, but this section does not contribute to the LSAT score. The score scale for the LSAT extends from a low of 120 to a high of 180. In addition to the objective portions, a 35-minute writing sample is administered at the end of the test. The section is not scored, but copies of the writing sample are sent to all law schools to which the examinee applies.

The LSAT has acceptable reliability (internal consistency coefficients in the .90s) and is regarded as a moderately valid predictor of law school grades. Yet, in one fascinating study, LSAT scores correlated more strongly with state bar test results than with law school grades (Melton, 1985). This speaks well for the validity of the test, insofar as it links LSAT scores with an important, real-world criterion.

In recent years, those responsible for law school admissions have shown interest in selection methods that go beyond the LSAT. One example is a promising project from the University of Cali- fornia, Berkeley, which ambitiously seeks to assess

Each of the MCAT scores is reported on a scale from 1 to 15 (means of about 8.0 and standard deviations of about 2.5). The reliability of the test is lower than that of other aptitude tests used for se- lection, with internal consistency and split-half co- efficients mainly in the low .80s (Gregory, 1994a). MCAT scores are mildly predictive of success in medical school, but once again the restriction of range conundrum (previously discussed in relation to the GRE) is at play. In particular, examinees with low MCAT scores who would presumably confirm the validity of the test by performing poorly in medi- cal school are rarely admitted, which reduces the ap- parent validity of the test.

Julian (2005) confirmed the validity of the MCAT for predicting medical school performance by following 4,076 students who entered 14 medi- cal schools in 1992 and 1993. Outcome variables included GPA and national medical licensing exam scores. When corrected for restriction of range, the predictive validity coefficients for MCAT scores were impressive, on the order of .6 for medical school grades, and as high as .7 for licensing exam scores. In fact, the MCAT scores were so strongly predictive of licensing exam scores that adding un- dergraduate GPAs into the equation did not appre- ciably boost the correlation. Julian (2005) concludes that MCAT scores essentially replace the need for undergraduate GPAs in medical school student se- lection because of their remarkable capacity to pre- dict medical licensing exam scores.

Law School Admission Test (LSAT)

The LSAT is more than 60 years old. The test arose in the 1940s as a group effort from deans of leading law schools, who used first year grades in the early validation of the instrument (LaPiana, 1998). Prac- ticality was a major impetus for test development, as law schools were flooded with worthy applicants. Also, there was an idealistic desire to ensure that admission to law school was based on aptitude and potential, not on privilege or connection. A leading figure in LSAT development has noted:

What makes us Americans is our adherence to the system that governs our nation. If that’s true, then being a lawyer is one of the most

234 Chapter 6 • Ability Testing: Group Tests and Controversies

Thus, achievement tests serve institutional goals such as monitoring schoolwide achievement levels, but also play an important role in the assess- ment of individual learning difficulties. As previ- ously noted, different kinds of achievement tests are used to pursue these two fundamental applications (institutional and individual). Institutional goals are best served by group achievement test batteries, whereas individual assessment is commonly pur- sued with individual achievement tests (even though group tests may play a role here, too). Here we focus on group educational achievement tests.

Virtually every school system in the nation uses at least one educational achievement test, so it is not surprising that test publishers have responded to the widespread need by developing a panoply of excellent instruments.

In the following section, we describe several of the most widely used group standardized achieve- ment tests. We limit our coverage here to three educational achievement tests, each distinctive in its own way. The Iowa Tests of Basic Skills (ITBS) is representative of the huge industry of standard- ized achievement testing used in virtually all school systems nationwide. The Metropolitan Achievement Test is of the same genre as the ITBS but embodies a new and powerful technique of reading assessment known as the Lexile approach and, thus, merits spe- cial attention. Finally, almost everyone has heard of the Tests of General Educational Development, known familiarly as the “GED.” We would be remiss not to discuss this testing program.

Iowa Tests of Basic Skills (ITBS)

First published in 1935, the Iowa Tests of Basic Skills (ITBS) were most recently revised and restandard- ized in 2001. The ITBS is a multilevel battery of achievement tests that covers grades K through 8. A companion test, the Tests of Achievement and Proficiency (TAP), covers grades 9 through 12. In order to expedite direct and accurate comparisons of achievement and ability, the ITBS and the TAP were both concurrently normed with the Cognitive Abilities Test (CogAT), a respected group test of general intellectual ability.

The ITBS is available in several levels that correspond roughly with the ages of the potential

26 traits identified as crucial to effective perfor- mance of lawyers (Chamberlin, 2009). Using fo- cus groups and individual interviews, psychologist Sheldon Zedeck and lawyer Marjorie Shultz dis- tilled these 26 traits, which include varied capacities like practical judgment, researching the law, writ- ing, integrity/ honesty, negotiation skills, develop- ing relationships, stress management, fact finding, diligence, listening, and community involvement/ service. Next they developed realistic scenarios de- signed to evaluate one or more of these qualities. A sample question might ask the applicant to take the role of a team leader in a law firm. A verbal fight breaks out between two of the team members over the best way to proceed with the project. What should the team leader do? A number of options are listed, and the applicant is asked to rank them from best to worst. The format of the questions is varied. For other questions, the applicant might be asked to provide a short written response. Initial research with this yet-unnamed instrument indicates that it predicts success in the practice of law substantially better than the LSAT.

EDUCATIONAL ACHIEVEMENT TESTS

Achievement tests permit a wide range of potential uses. Practical applications of group achievement tests include the following:

• To identify children and adults with specific achievement deficits who might need more detailed assessment for learning disabilities

• To help parents recognize the academic strengths and weaknesses of their children and thereby foster individual remedial efforts at home

• To identify classwide or schoolwide achieve- ment deficiencies as a basis for redirection of instructional efforts

• To appraise the success of educational pro- grams by measuring the subsequent skill at- tainment of students

• To group students according to similar skill level in specific academic domains

• To identify the level of instruction that is ap- propriate for individual students

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 235

examinees: levels 5–6 (grades K–1), levels 7–8 (grades 2–3), and levels 9–14 (grades 3–8). The basic subtests for the older levels measure vocabu- lary, reading, language, mathematics, social studies, science, and sources of information (e.g., uses of maps and diagrams). A brief description of the sub- tests for grades 3–8 is provided in Table 6.3.

From the first edition onward, the ITBS has been guided by a pragmatic philosophy of educa- tional measurement. The manual states the purpose of testing as follows:

The purpose of measurement is to provide information which can be used in improving instruction. Measurement has value to the ex- tent that it results in better decisions which di- rectly affect pupils.

To this end, the ITBS incorporates a criterion- referenced skills analysis to supplement the usual array of norm-referenced scores. For example, one feature available from the publisher’s scoring service is item-level information. This information indicates topic areas, items sampling the topic, and correct or wrong response for each item. Teachers, therefore, have access to a wealth of diagnostic-instructional in- formation for each student. Whether this information translates to better instruction—as the test authors desire—is very difficult to quantify. As Linn (1989) notes, “We must rely mostly on logic, anecdotes, and opinions when it comes to answering such questions.”

The technical properties of the ITBS are be- yond reproach. Historically, internal consistency and equivalent-form reliability coefficients are mostly in the mid-.80s to low .90s. Stability coeffi- cients for a one-year interval are almost all in the .70 to .90 range. The test is free from overt racial and gender bias, as determined by content evaluation and item bias studies. The year 2000 norms for the test were empirically developed from large, repre- sentative national probability samples.

Item content of the ITBS is judged relevant by curriculum experts and reviewers, which speaks to the content validity of the test (Lane, 1992; Linn, 1989). Although the predictive validity of the lat- est ITBS has not been studied extensively, evidence from prior editions is very encouraging. For ex- ample, ITBS scores correlate moderately with high

TABLE 6.3 Brief Description of ITBS Subtests for Grades 3–8

Vocabulary: A word is presented in the context of a short phrase or sentence, and students select the correct meaning from multiple-choice alternatives.

Reading Comprehension: Students read a brief passage and answer multiple-choice questions that require inference or generalization.

Spelling: Each multiple-choice item presents four words, one of which may be misspelled, and fifth option, no mistakes.

Capitalization: Test items require students to identify errors of under- or overcapitalization present in brief written passages.

Punctuation: Multiple-choice items require students to identify errors of punctuation involving commas, apostrophes, quotation marks, colons, and so on, or choose no mistakes.

Usage and Expression: In the first part, students identify errors in usage or expression; in the second part, students choose the best way to express an idea.

Math Concepts and Estimation: Questions deal with computation, algebra, geometry, measurement, and probability and statistics.

Math Problem Solving and Data Interpretation: Questions may involve multistep word problems or interpretation of tables and graphs.

Math Computation: These test items require the use of one arithmetic operation (addition, subtraction, multiplication, or division) with whole numbers, fractions, and decimals.

Social Studies: These questions involve aspects of history, geography, economics, and so on that are ordinarily covered in most school systems.

Science: These test items involve aspects of biology, ecology, space science, and physical sciences ordinarily covered in most school systems.

Maps and Diagrams: These questions evaluate the ability to use maps for a variety of purposes such as determining locations, directions, and distances.

Reference Materials: These questions measure the ability to use reference materials and library resources.

236 Chapter 6 • Ability Testing: Group Tests and Controversies

The Lexile scale is a true interval scale. The Lexile measure for a reading selection is a specific number indicating the reading demand of the text based on the semantic difficulty (vocabulary) and syntactic complexity (sentence length). Lexile mea- sures for reading selections typically range from 200L to 1,700L (Lexiles). The Lexile score for a student, obtained from the Reading Comprehen- sion test of the MAT or other achievement tests, is a precise index of the student’s reading ability, cali- brated on the same scale as the Lexile measure for text. The value of the Lexile approach is that student comprehension can be predicted as a function of the disparity between the demands of the text and the student’s ability. For example, when readers are well targeted (the difference between text and reader is close to 0 Lexiles), research indicates that reader comprehension will be about 75 percent. When the text difficulty exceeds the reader’s ability by 250L, comprehension drops to about 50 percent. When the skill of the reader exceeds the demands of the text by 250L, comprehension is about 90 percent (www.lexile.com).

The Lexile approach has a number of potential benefits and applications for teachers and parents. Teachers can look up Lexile measures for specific books (the Lexile corporation has evaluated over 30,000 titles to date) as a way of building a library of titles at varying levels. Also, they can produce in- dividualized reading lists suitable for each student. Likewise, parents can select well-matched books to read to their children. Stenner (2001) captures the allure of the Lexile approach as follows:

One of the great strengths of the Lexile Frame- work is the way it encourages thought about what forecasted comprehension rate would be optimal for different instructional contexts. Harry Potter and the Goblet of Fire is a 910L text. Readers at 400L to 500L can nonetheless enjoy listening to this story read aloud. A 700L reader could read the text in a one-on-one tutoring context. A 900L reader will disappear for an hour or two, fully capable of self-engaging with the text, and a 1600L adult reader can become so engrossed that a two-hour plane ride flies by.

The Lexile approach is not a panacea, but it is a ma- jor improvement in the assessment of reading skill.

school grades (r’s around .60). The ITBS is not a perfect instrument, but it represents the best that modern test development methods can produce.

Metropolitan Achievement Test (MAT)

The Metropolitan Achievement Test dates back to 1930 when the test was designed to meet the curricu- lum assessment needs of New York City. The stated purpose of the MAT is “to measure the achievement of students in the major skill and content areas of the school curriculum.” The MAT is concurrently normed with the Otis-Lennon School Ability Test (OLSAT).

Now in its eighth edition, the MAT is a multi- level battery designed for grades K through 12 and was most recently normed in 2000. The areas tested by the MAT include the traditional school-related skills:

Reading Mathematics Language Writing Science Social Studies

An attractive feature of the MAT is that stu- dent reading scores are reported as Lexile measures, a new and practical indicator of reading level. Lexile measures are likely to become a standard feature in most group achievement tests in the years ahead, so it is worth a brief detour to explain their nature and significance.

Lexile Measures

The Lexile approach is a major new improvement in the assessment of reading skill. It was developed over a span of more than 12 years using millions of dollars in grant funds from the National Institute of Child Health and Human Development (NICHD) (www.lexile.com). The Lexile approach is based on two simple, commonsense assumptions, namely (1) reading materials can be placed on a continuum as to difficulty level (comprehensibility) and (2) read- ers can be ordered on a continuum as to reading ability. The Lexile framework provides a common metric for matching readers and text, which, in turn, permits parents and educators to choose appropriate reading materials for children.

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 237

life experiences or independent study. Employers regard the GED as equivalent (if not superior) to earning a high school diploma. Successful perfor- mance on the GED enables individuals to apply to colleges, seek jobs, and request promotions that re- quire a high school diploma as a prerequisite. Rog- ers (1992) provides an unusually thorough review of the GED.

Additional Group Standardized Achievement Tests

In addition to the previously described batteries, a few other widely used group standardized achieve- ment tests deserve brief listing. These instruments are depicted in Table 6.4.

Tests of General Educational Development (GED)

Another widely used achievement test battery is the Tests of General Educational Development (GED), developed by the American Council on Education and administered nationwide for high school equiv- alency certification (www.acenet.edu). The GED consists of multiple-choice examinations in five ed- ucational areas:

Language Arts—Writing Language Arts—Reading Mathematics Science Social Studies

The Language Arts—Writing section also contains an essay question that examinees must answer in writing. The essay question is scored independently by two trained readers according to a 6-point holis- tic scoring method. The readers make a judgment about the essay based on its overall effectiveness in comparison to the effectiveness of other essays.

The GED comes in numerous alternate forms. Typically, internal consistency reliabilities for the subscales are above .90. However, the interrater re- liability of scoring on the writing samples is more modest, typically between .6 and .7. These findings indicate that a liberal criterion for passing this sub- test is appropriate so as to reduce decision errors. Regarding validity, the GED correlates very strongly (r = .77) with the graduation reading test used in New York (Whitney, Malizio, & Patience, 1985). Furthermore, the standards for passing the GED are more stringent than those employed by most high schools: Currently, individuals who receive a passing score for a GED credential outperform at least 40 percent of graduating high school seniors (www.acenet.edu).

The GED emphasizes broad concepts rather than specific facts and details. In general, the pur- pose of the GED is to allow adults who did not graduate from high school to prove that they have obtained an equivalent level of knowledge from

TABLE 6.4 Selected Group Achievement Tests for Elementary and Secondary School Assessment

Iowa Tests of Educational Development (ITED)

Designed for grades 9 through 12, the objective of this test battery is to measure the fundamental goals or generalized skills of education that are independent of the curriculum. Most of the test items require the synthesis of knowledge or a multiple-step solution.

Tests of Achievement and Proficiency (TAP)

This instrument is designed to provide a comprehensive appraisal of student progress toward traditional academic goals in grades 9 through 12. This test is co-normed with the ITED and the CogAT.

Stanford Achievement Test (SAchT)

Along with the ITBS, the SAchT is one of the leading contemporary achievement tests. Dating back more than 80 years and now in its tenth edition, it is admin- istered to more than 15 million students every year.

TerraNova CTBS

For grades 1 through 12, this multi-level test combines multiple-choice questions with constructed response items that require students to produce correct answers, not just select them from alternatives.

238 Chapter 6 • Ability Testing: Group Tests and Controversies

The reader is warned that the research issues pur- sued here are complex, confusing, and occasionally contradictory. However, the rewards for grappling with these topics are substantial. After all, the mean- ing of intelligence tests is demarcated, sharpened, and refined entirely by empirical research.

THE QUESTION OF TEST BIAS

Beyond a doubt, no practice in modern psychology has been more assailed than psychological testing. Commentators reserve a special and often vehement condemnation for ability testing in particular. In his wide-ranging response to the hundreds of criticisms aimed at mental testing, Jensen (1980) concluded that test bias is the most common rallying point for the critics. In proclaiming test bias, the skeptics assert in various ways that tests are culturally and sexually bi- ased so as to discriminate unfairly against racial and ethnic minorities, women, and the poor. We cite here a sampling of verbatim criticisms ( Jensen, 1980):

• Intelligence tests are sadly misnamed because they were never intended to measure intelli- gence and might have been more aptly called CB (cultural background) tests.

• Persons from backgrounds other than the culture in which the test was developed will always be penalized.

• There are enormous social class differences in a child’s access to the experiences necessary to acquire the valid intellectual skills.

• IQ scores reported for African Americans and low socioeconomic groups in the United States reflect characteristics of the test rather than of the test takers.

A n intelligence test is a neutral, inconsequen- tial tool until someone assigns significance to the results derived from it. Once mean-

ing is attached to a person’s test score, that indi- vidual will experience many repercussions, ranging from superficial to life-changing. These repercus- sions will be fair or prejudiced, helpful or harmful, appropriate or misguided—depending on the mean- ing attached to the test score.

Unfortunately, the tendency to imbue intelli- gence test scores with inaccurate and unwarranted connotations is rampant. Laypersons and students of psychology commonly stray into one thicket of harmful misconceptions after another. Test results are variously overinterpreted or underinterpreted, viewed by some as a divination of personal worth but devalued by others as trivial and unfair.

The purpose of this topic is to clarify further the meaning of intelligence test scores in the light of relevant behavioral research. We begin by dispelling a number of everyday misconceptions about IQ and then pursue several empirically based issues—some would say controversies—that bear on the meaning of intelligence test scores:

• The question of test bias • Genetic and environmental effects on

intelligence • Origins of IQ differences between African

Americans and Caucasian Americans • The fate of intelligence in middle and old age • Generational changes in intelligence test scores

The underlying theme of this section is that intelligence test scores are best understood within the framework of modern psychological research.

TOPIC 6B Test Bias and Other Controversies

The Question of Test Bias

Case Exhibit 6.1 The Impact of Culture on Testing Bias

Social Values and Test Fairness

Genetic and Environmental Determinants of Intelligence

Origins and Trends in Racial IQ Differences

Age Changes in Intelligence

Generational Changes in IQ Scores

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 239

racial and ethnic groups. For example, African Americans score, on average, about 15 points lower than White Americans on standardized IQ tests. This difference reduces to 7 to 12 IQ points when socioeconomic disparities are taken into account. The existence of marked racial/ethnic differences in ability test scores has fanned the fires of con- troversy over test bias. After all, employment op- portunities, admission to college, completion of a high school diploma, and assignment to special education classes are all governed, in part, by test results. Biased tests could perpetuate a legacy of ra- cial discrimination. Test bias is deservedly a topic of intense scrutiny by both the public and the test- ing professions.

One possibility is that the observed IQ dispari- ties indicate test bias rather than meaningful group differences. In fact, most laypersons and even some psychologists would regard the magnitude of race differences in IQ as prima facie evidence that intel- ligence tests are culturally biased. This is an appeal- ing argument, but a large difference between defined subpopulations is not a sufficient basis for proving test bias. The proof of test bias must rest on other criteria outlined in the following section.

When do test score differences between groups signify test bias? We begin by reviewing the criteria that should be used to investigate test bias of any kind, whether for race, gender, or any other de- fining characteristic.

Criteria of Test Bias and Test Fairness

The topic of test bias has received wide attention from measurement psychologists, test developers, journalists, test critics, legislators, and the courts. Cole and Moss (1998) underscore an unset- tling consequence of the proliferation of views held on this topic, namely, concepts of test bias have become increasingly intricate and complex. Furthermore, the understanding of test bias is made difficult by the implicit and often emotional assumptions—held even by scholars—that may lead honest persons to view the same information in different ways.

In part, disagreements about test bias are per- petuated because adversaries in this debate fail to

• The poor performance of African American children on conventional tests is due to the biased content of the tests; that is, the test material is drawn from outside the African American culture.

• Women are not so good as men at mathemat- ics only because women have not taken as much math in high school and college.

Are these criticisms valid? The investigation of this question turns out to be considerably more complicated than the reader might suppose. A most important point is that appearances can be deceiv- ing. As we will explain subsequently, the fact that test items “look” or “feel” preferential to one race, sex, or social class does not constitute proof of test bias. Test bias is an objective, empirical question, not a matter of personal judgment.

Although critics may be loath to admit it, dispas- sionate and objective methods for investigating test bias do exist. One purpose of this section is to present these methods to the reader. However, an aseptic discussion of regression equations and statistical definitions of test bias would be incomplete, only half of the story. Con- ceptions of test bias are irretrievably intermingled with notions of test fairness. A full explanation of the story surrounding the test-bias controversy requires that we investigate the related issue of test fairness, too.

Differences in terminology abound in this area, so it is important to set forth certain fundamental dis- tinctions before proceeding. Test bias is a technical concept amenable to impartial analysis. The most sa- lient methods for the objective assessment of test bias are discussed in the following. In contrast, test fairness reflects social values and philosophies of test use, par- ticularly when test use extends to selection for privilege or employment. Much of the passion that surrounds the test-bias controversy stems from a failure to dis- tinguish test bias from test fairness. To avoid confu- sion, it is crucial to draw a sharp distinction between these two concepts. We include separate discussions of test bias and test fairness, beginning with an analysis of why test bias is such a controversial topic.

The Test-Bias Controversy

The test-bias controversy has its origins in the ob- served differences in average IQ among various

240 Chapter 6 • Ability Testing: Group Tests and Controversies

inferences derived from it are appropriate, meaning- ful, and useful. One implication of this viewpoint is that test bias can be equated with differential validity for different groups:

Bias is present when a test score has meanings or implications for a relevant, definable sub- group of test takers that are different from the meanings or implications for the remainder of the test takers. Thus, bias is differential valid- ity of a given interpretation of a test score for any definable, relevant subgroup of test takers. (Cole & Moss, 1998)

Perhaps a concrete example will help clarify this definition. Suppose a simple word problem arithme- tic test were used to measure youngsters’ addition skills. The problems might be of the form “If you have two six-packs of pop, how many cans do you have altogether?” Suppose, however, the test is used in a group of primarily Spanish-speaking seventh graders. With these children, low scores might indi- cate a language barrier, not a problem with arithme- tic skills. In contrast, for English-speaking children low scores would most likely indicate a deficit in arithmetic skills. In this example, the test has dif- ferential validity, predicting arithmetic deficits quite well for English-speaking children but very poorly for Spanish-speaking children. According to the technical perspective of test validation, we would conclude that the test is biased.

Although the general definition of test bias re- fers to differential validity, in practice the particular criteria of test bias fall under three main headings: content validity, criterion-related validity, and con- struct validity. We will review each of these catego- ries, discussing relevant findings along the way. The coverage is illustrative, not exhaustive. Interested readers should consult Jensen (1980), Cole and Moss (1998), and Reynolds and Brown (1984b).

Bias in Content Validity

Bias in content validity is probably the most com- mon criticism of those who denounce the use of standardized tests with minorities (Helms, 1992; Hilliard, 1984; Kwate, 2001). Typically, critics rely

clarify essential terminology. Too often, terms such as test bias and test fairness are considered inter- changeable and thrown about loosely without defi- nition. We propose that test bias and test fairness commonly refer to markedly different aspects of the test-bias debate. Careful examination of both con- cepts will provide a basis for a more reasoned dis- cussion of this controversial topic.

As interpreted by most authorities in this field, test bias refers to objective statistical indices that ex- amine the patterning of test scores for relevant sub- populations. Although experts might disagree about nuances, on the whole there is a consensus about the statistical criteria that indicate when a test is biased. We will expand this point later, but we can provide the reader with a brief preview here: In general, a test is deemed biased if it is differentially valid for different subgroups. For example, a test would be considered biased if the scores from appropriate subpopulations did not fall on the same regression line for a relevant criterion.

In contrast to the narrow concept of test bias, test fairness is a broad concept that recognizes the importance of social values in test usage. Even a test that is unbiased according to the traditional technical criteria of homogeneous regression might still be deemed unfair because of the social conse- quences of using it for selection decisions. The crux of the debate is this: Test bias (a statistical concept) is not necessarily the same thing as test fairness (a values concept). Ultimately, test fairness is based on social conceptions such as one’s image of a just society. In the assessment of test fairness, subjec- tive values are of overarching importance; the sta- tistical criteria of test bias are merely ancillary. We will return to this point later when we analyze the link between social values and test fairness. But let us begin with a traditional presentation of technical criteria for test bias.

The Technical Meaning of Test Bias: A Definition

One useful way to examine test bias is from the tech- nical perspective of test validation. The reader will recall from an earlier chapter that a test is valid when a variety of evidence supports its utility and when

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 241

of intelligence into one of three categories: least cul- tural, neutral, most cultural. McGurk administered these test items to hundreds of high school students. His primary analysis involved the test results for 213 African American students and 213 White students matched for curriculum, school, length of enrollment, and socio economic background.

McGurk (1953a, 1953b) discovered that the mean difference between African American and White students for the total hybrid test, expressed in standard deviation units, was .50. More pertinent to the topic of test bias in content validity was his com- parison of scores on the 37 “most cultural” items versus the 37 “least cultural” items. For the “most cultural” items—the ones nominated by the judges as highly culturally biased—the difference was .30. For the “least cultural” items—the ones judged to be more fair to African Americans and other cultural minorities—the difference was .58. In other words, the items nominated as most cultural were relatively easier for African Americans; the items nominated as least cultural were relatively harder. This finding held true even after item difficulty was partialed out. Furthermore, the item difficulties for the two groups were almost perfectly correlated (r = .98 for “most cultural” and r = .96 for “least cultural” items). There is an important lesson here that test critics often overlook: “Expert” judges cannot identify culturally biased test items based on an analysis of item char- acteristics. Recent studies continue to reaffirm this conclusion (Reynolds, Lowe, & Saenz, 1999).

In general, with respect to well-known stan- dardized tests of ability and aptitude, research has not supported the popular belief that the specific content of test items is a source of cultural bias against minor- ities. This conclusion does not exonerate these tests with respect to other criteria of test bias, discussed in the following sections. Furthermore, we can point out that savvy test developers should be vigilant even to the impression of bias in test content, since the appearance of unfairness can affect public attitudes about psychological tests in quite tangible ways.

Bias in Predictive or Criterion-Related Validity

The prediction of future performance is one impor- tant use of intelligence, ability, and aptitude tests.

on their own expert judgment when they expound one or more of the following criticisms of the content validity of ability tests:

1. The items ask for information that ethnic mi- nority or disadvantaged persons have not had equal opportunity to learn.

2. The scoring of the items is improper, since the test author has arbitrarily decided on the only correct answer and ethnic minorities are inap- propriately penalized for giving answers that would be correct in their own culture but not that of the test maker.

3. The wording of the questions is unfamil- iar, and an ethnic minority person who may “know” the correct answer may not be able to respond because he or she does not under- stand the question (Reynolds, 1998).

Any of these criticisms, if accurate, would constitute bona fide evidence of test bias. However, merely stating a criticism does not comprise proof. Where these criticisms fall short is that they are sel- dom buttressed by empirical evidence.

Reynolds (1998) has offered a definition of content bias for aptitude tests that addresses the pre- ceding points in empirically defined, testable terms:

An item or subscale of a test is considered to be biased in content when it is demonstrated to be relatively more difficult for members of one group than another when the general abil- ity level of the groups being compared is held constant and no reasonable theoretical ratio- nale exists to explain group differences on the item (or subscale) in question.

This definition is useful because it proposes an em- pirical approach to the question of test bias.

In general, attempts to prove that expert- nominated items are culturally biased have not yielded the conclusive evidence that critics expect. McGurk (1953a, 1953b, 1975) has written extensively on this topic, and we will use his classic study to illustrate this point. For his doctoral dissertation, McGurk asked a panel of 78 judges (professors, educators, and gradu- ate students in psychology and sociology) to classify each of 226 items from well-known standardized tests

242 Chapter 6 • Ability Testing: Group Tests and Controversies

modeled after Cleary, Humphreys, Kendrick, and Wesman (1975).

Suppose we are using a scholastic aptitude test to predict first-year grade point average (GPA) in college. In the case of a simple regression analysis, prediction of future performance is made from an equation of the form:

Y = bX + a

where Y is the predicted college GPA, X is the score on the aptitude test, and b and a are constants de- rived from a statistical analysis of test scores and grades of prior students. We will not concern our- selves with how b and a are derived; the reader can find this information in any elementary statistics textbook.

The values of b and a correspond to important aspects of the regression line—the straight line that facilitates the most accurate prediction of the crite- rion (college grades) from the predictor (aptitude score) (Figure 6.7). In particular, b corresponds to the slope of the line, with higher values of b indicat- ing a steeper slope and more accurate prediction. The value of a depicts the intercept on the vertical axis. The units of measurement for b and a cannot

For this application of psychological testing, pre- dictive validity is the most crucial form of validity in relation to test bias. In general, an unbiased test will predict future performance equally well for per- sons from different subpopulations. For example, an unbiased scholastic aptitude test will predict future academic performance of African Americans and White Americans with near-identical accuracy.

Reynolds (1998) offers a clear, direct defini- tion of test bias with regard to criterion-related or predictive validity bias:

A test is considered biased with respect to pre- dictive validity if the inference drawn from the test score is not made with the smallest feasi- ble random error or if there is constant error in an inference or prediction as a function of membership in a particular group.

This definition of test bias invokes what might be referred to as the criterion of homogeneous regres- sion. According to this viewpoint, a test is unbiased if the results for all relevant subpopulations cluster equally well around a single regression line. In order to clarify this point, we need to introduce concepts relevant to simple regression. The discussion is

FIGURE 6.7 Test Scores, Grades, and Regression Line for a Hypothetical Large

Group of College Students

Note: The dotted line shows how the regression line can be used to predict grade

point average from the test score for a single, new subject.

C o lle

g e G

ra d e P

o in

t A

v e ra

g e

4.0

3.0

2.0

1.0

200 300 400 500 600 700 800

Score on an Aptitude Test

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 243

scores for group A would be overpredicted, whereas criterion scores for group B would be underpre- dicted. Thus, the use of a single regression line would constitute a clear instance of test bias, because the test has differential predictive validity for differ- ent subgroups.1 This is referred to as intercept bias because the Y-axis intercept is different for the two groups.

But what about using separate regression lines for each subgroup? Would this solve the problem and rescue the test from criterion-related test bias? Opinions differ on this point. Although there is no doubt that separate regression equations would maximize predictive accuracy for the combined sample, whether this practice would produce test fairness is debated. We return to this issue later, when we discuss the relevance of social values to test fairness.

The Scholastic Aptitude Test (now known as the Scholastic Assessment Test and discussed in a later chapter) has been analyzed by several re- searchers with regard to test bias in criterion-related validity (Cleary, Humphreys, Kendrick, & Wesman,

be specified in advance because they depend on the underlying scales used for X and Y. Notice in Figure 6.7 that the regression line is the reference for pre- dicting grades from observed aptitude score.

According to the criterion of homogeneous regression, in an unbiased test a single regression line can predict performance equally well for all rel- evant subpopulations, even though the means for the different groups might differ. For example, in Figure 6.8 group A performs better than group B on both predictor and criterion. Yet, the relationship between aptitude score and grades is the same for both groups. In this hypothetical instance, the graph depicts the absence of bias on the aptitude test with respect to criterion-related validity.

A more complicated situation known as inter- cept bias is shown in Figure 6.9. In this case, scores for the two groups do not cluster tightly around the single best regression line shown as a dotted line in the graph. Separate, parallel regression lines (and, therefore, separate regression equations) would be needed to facilitate accurate prediction. If a single regression line were used (the dotted line), criterion

FIGURE 6.8 Test Scores, Grades, and Single Regression Line for Two

Hypothetical Large Subpopulations of College Students

C o

lle g e G

ra d e P

o in

t A

v e

ra g

4.0

3.0

2.0

1.0

200 300 400 500 600 700 800

Score on an Aptitude Test

1Contrary to widely held belief, test bias in these cases actually favors the lower-scoring group because its performance on the criterion is overpredicted. On occasion, then, test bias can favor minority groups.

244 Chapter 6 • Ability Testing: Group Tests and Controversies

dotted line) for prediction might, therefore, result in both under- and overprediction of scores for se- lected subjects in both groups. Professional opinion would be unanimous in this case: This test possesses a high degree of test bias in criterion-related validity.

Bias in Construct Validity

The reader will recall that the construct validity of a psychological test can be documented by diverse forms of evidence, including appropriate develop- mental patterns in test scores, theory-consistent in- tervention changes in test scores, and confirmatory factor analysis. Because construct validity is such a broad concept, the definition of bias in construct

validity requires a general statement amenable to research from a variety of viewpoints with a broad range of methods. Reynolds (1998) offers the follow- ing definition:

Bias exists in regard to construct validity when a test is shown to measure different hypotheti- cal traits (psychological constructs) for one group than for another; that is, differing in- terpretations of a common performance are shown to be appropriate as a function of eth- nicity, gender, or another variable of interest, one typically but not necessarily nominal.

1975; Manning & Jackson, 1984). A consistent finding is that separate, parallel, regression lines are needed for African American and White examinees. For example, in one school the best regression equa- tions for African American, White, and combined students were as follows:

African American: Y = .055 + .0024V + .0025M

White: Y = .652 + .0026V + .0011M

Combined: Y = .586 + .0027V + .0012M

where Y is the predicted college grade point, V is the SAT Verbal score, and M is the SAT Mathematics score (Cleary et al., 1975, p. 29). The effect of using the White or the combined formula is to overpredict college grades for African American subjects based on SAT results. On the traditional four-point scale (A = 4, B = 3, etc.), the average amount of overpre- diction from 17 separate studies was .20 or one-fifth of a grade point (Manning & Jackson, 1984). What these results mean is open to debate, but it seems clear, at least, that the SAT and similar entrance ex- aminations do not underpredict college grades for minorities.

The most peculiar regression outcome, known as slope bias, is depicted in Figure 6.10. In this case, the regression lines for separate subgroups are not even parallel. Using a single regression line (the

FIGURE 6.9 Test Scores, Grades, and Parallel Regression Lines for Two

Hypothetical Large Subpopulations of College Students

C o

lle g e G

ra d e P

o in

t A

v e ra

g e

4.0

3.0

2.0

1.0

200 300 400 500 600 700 800

Score on an Aptitude Test

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 245

of studies (Scheuneman, 1987; Gutkin & Reynolds, 1981; Johnston & Bolen, 1984), research in this area is more notable for its consistent findings with re- spect to factorial invariance across subgroups (e.g., Geary & Whitworth, 1988).

A second criterion of nonbias in construct validity is that the rank order of item difficulties within a test should be highly similar for relevant subpopulations. Since age is a major determinant of item difficulty, this standard is usually checked separately for each age group covered by a test. The reader should note what this criterion does not specify. It does not specify that relevant subgroups must obtain equivalent passing rates for test items. What is essential is that the items that are the most difficult (or least difficult) for one subgroup should be the most difficult (or least difficult) for other rel- evant subpopulations.

The criterion of similar rank order of item dif- ficulties can be tested in a very straightforward and objective manner. If the difficulty level of each item is computed by means of the p value ( percentage passing) for each relevant subpopulation, then it is possible to compare the relative item difficulties across same-aged subgroups. In fact, the similar- ity of the rank order of item difficulties for any two

From a practical standpoint, two straightforward criteria for nonbias flow from this definition ( Reynolds & Brown, 1984a). If a test is nonbiased, then comparisons across relevant subpopulations should reveal a high degree of similarity for (1) the factorial structure of the test and (2) the rank order of item difficulties within the test. Let us examine these criteria in more detail.

An essential criterion of nonbias is that the factor structure of test scores should remain invari- ant across relevant subpopulations. Of course, even within the same subgroup, the factor structure of a test might differ between age groups, so it is impor- tant that we restrict our comparison to same-aged persons from relevant subpopulations. For same- aged subjects, a nonbiased test will possess the same factor structure across subgroups. In particular, for a nonbiased test the number of emergent factors and the factor loadings for items or subscales will be highly similar for relevant subpopulations.

In general, when the items or subscales of prominent ability and aptitude tests are factor- analyzed separately in White and minority samples, the same factors emerge in the relevant subpopula- tions (Reynolds, 1982; Jensen, 1980, 1984). Although minor anomalies have been reported in a handful

FIGURE 6.10 Test Scores, Grades, and Nonparallel Regression Lines for Two

Hypothetical Large Subpopulations of College Students

C o

lle g

e G

ra d e P

o in

t A

v e ra

g e

4.0

3.0

2.0

1.0

200 300 400 500 600 700 800

Score on an Aptitude Test

246 Chapter 6 • Ability Testing: Group Tests and Controversies

11 through 15 did not show a smooth decline, as would be found in a nonbiased test:

Item Percent

Number Passing

Item 11 81

Item 12 61

Item 13 16

Item 14 45

Item 15 31

Item 13 reveals clear evidence of bias in construct validity—it is substantially more difficult than the preceding and following items. We cannot reveal the content of these copyrighted test items. However, we can say that item 13 requires the child to know about a well-known Italian explorer who reputedly discov- ered America. Actually, which foreigners first landed on American shores is an item of dispute—but that is another issue (Menzies, 2003). What is clear in this case is that item 13 on the WISC-III Informa- tion subtest requires knowledge that is unpalatable to most Native American examinees. The explorer in question is not a revered figure in this subculture. As Gregory (2009) notes:

We can well imagine the confusion of these indigenous people who have been on this con- tinent for many thousands of years trying to fathom the notion that a European “discov- ered” their land.

In fairness, we should mention that clear examples of psychometrically confirmed test bias such as this are not common in published literature. Even so, this example serves as a reminder that ongoing in- vestigations of test bias are still needed.

Reprise on Test Bias

Critics who hypothesize that tests are biased against minorities assert that the test scores underestimate the ability of minority members. As we have argued in the preceding sections, the hypothesis of test bias is a scientific question that can be answered em- pirically through such procedures as factor analysis, regression equations, intergroup comparisons of the

groups can be gauged objectively by means of a correlation coefficient (rxy). The paired p values for the test items constitute the values of x and y used in the computation. The closer the value of r to 1.00, the more similar the rank ordering of item difficul- ties for the two groups.

In general, cross-group comparisons of rela- tive item difficulties for prominent aptitude and ability tests have yielded correlations bordering on 1.00; that is, most tests show extremely similar rank orderings for item difficulties across relevant subpopulations (Jensen, 1980; Reynolds, 1982). In a representative study, Miele (1979) investi- gated the relative item difficulties of the WISC for African American and White subjects at each of four grade levels (preschool, first, third, and fifth grades). He found that the average cross-racial correlations (holding grade level constant) for WISC item p values was .96 for males and .95 for females. These values were hardly different from the cross-sex correlations (holding grade level constant) within race, which were .98 (Whites) and .97 (African Americans). As noted, these find- ings are not unusual.

In general, for mainstream cognitive tests, the rank order of item difficulties is nearly identical for relevant subpopulations, including minority groups. However, some exceptions have been noted. For ex- ample, Urquhart-Hagie, Gallipo, and Svien (2003) report some striking examples of apparent item bias in a WISC-III study of 28 teenage children on the Lakota Sioux reservation in South Dakota. These au- thors computed the passing rates for the WISC-III subtest items and found dramatic deviations in the relative difficulty levels of consecutive items on a few of the subtests. For example, consider the Informa- tion subtest, which consists of 30 items ranked from very easy (nearly 100% passing rate) to very hard (less than 1% passing rate). These items evaluate the child’s fund of basic information, with questions on a par with “How many legs does a cat have?” (easy) or “Which continent includes Argentina?” (medium) or “Who is the Dalai Lama?” (hard). The problem noted by Urquhart-Hagie et al. (2003) on the Information subtest is that item 13 was passed at a substantially lower rate than expected. Specifically, the percentage of the sample passing items

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 247

that potential bias does not reside solely within the qualities of the testing instrument. Bias can arise within the complexities of clinical interactions, es- pecially when cultural differences exist between the practitioner and the client. The choice of a test and the timing of its application may impact the validity of the results, as we illustrate in Case Exhibit 6.1

CASE EXHIBIT 6.1

The Impact of Culture on Testing Bias

The most commonly used tests of cognitive func- tioning come from the United States or western European nations. These instruments embody a Western perspective, with a focus on skills valued in urban and industrial settings (Poortinga & Van de Vijver, 2004). But culture impacts more than just test content, culture also shapes our understanding of the assessment process itself. For example, most Westerners recognize that the purpose of consulta- tion with a health care professional is to convey use- ful information to the practitioner. They know that the practitioner will conduct needed tests or proce- dures to help identify appropriate interventions. An implicit social contract guides the understanding of all parties.

But not every culture has the same under- standing of this practitioner–patient covenant. We consider here the case of Mr. Kim, a 70-year-old man brought to a Latina psychologist by his daugh- ter (Hayes, 2008). Mr. Kim was a second-generation Korean referred by his physician because of con- cerns about “memory loss.” The psychologist— we will call her Dr. Santiago—met initially with Mr. Kim and his adult daughter, Insook. The daugh- ter seemed thoroughly acculturated to the United States, readily offering her thoughts. In contrast, Mr. Kim seemed more traditionally Korean, spoke rarely, and then in a low voice with a slight accent. He seldom made eye contact. Dr. Santiago made the cultural mistake of beginning the consulta- tion by directing questions to Insook, an affront to Mr. Kim. In many Asian cultures, elderly persons expect to be treated with dignity and reverence, es- pecially by their children (Kim, Kim, & Rue, 1997). The psychologist sensed that something was amiss,

difficulty levels for “biased” versus “unbiased” items, and rank ordering of item difficulties. In general, most investigators have found by these criteria that major ability and aptitude tests lack bias (Jensen, 1980; Reynolds, 1994a; Kuncel & Sackett, 2007; Sackett, Borneman, & Connelly, 2008).

Recently, however, Aguinis, Culpepper, and Pierce (2010) have called into question the prevail- ing wisdom, using a complex statistical simulation to demonstrate that tests of bias are themselves bi- ased. Their method, called Monte Carlo simulation, is beyond the scope of coverage here. They deduced that most studies of slope bias (rarely found in bias studies) do not possess sufficient statistical power to detect it. As noted earlier, slope bias results in the overprediction and underprediction of minor- ity performance at different levels of the predictor variable. They also conclude that most studies of intercept bias (often found in bias studies, favoring minorities) are the result of a complex statistical ar- tifact. Intercept bias is the systematic overprediction of scores for one group at all levels of the predictor variable. They conclude:

We are aware that we have set a tall-order goal of reviving research on test bias in pre- employment testing in the face of established conclusions in the fields of I/O psychology, management, and others concerned with high-stakes testing. Our results indicate that the accepted procedure to assess test bias is itself biased: Slope-based bias is likely to go undetected and intercept based bias favoring minority group members is likely to be found when in fact it does not exist (Aguinis et al., 2010, p. 653).

The authors call for a renewal of interest in research on test bias in high-stakes testing and suggest methods to improve research in this area, includ- ing the use of power analysis to determine sample sizes needed for valid inferences about differential prediction.

Analyses of test bias focus mainly on the sta- tistical properties of selected instruments, looking for differential validity in the application of tests with minority examinees. But it is good to remember

248 Chapter 6 • Ability Testing: Group Tests and Controversies

Mr. Kim’s demeanor is not uncommon among people of Korean and Buddhist cultures, for whom emotional restraint is often seen as a sign of maturity and problems are considered a fact of life (p. 145).

Sometimes choosing not to administer an ostensibly suitable test is the proper course of action, the neces- sary antidote to bias in testing.

We turn now to the broader concept of test fairness. How well do existing instruments meet reasonable criteria of test fairness? As the reader will learn, test fairness involves social values and is, therefore, an altogether more debatable—and more debated—topic than test bias.

SOCIAL VALUES AND TEST FAIRNESS

Even an unbiased test might still be deemed unfair because of the social consequences of using it for se- lection decisions. In contrast to the narrow, objec- tive notion of test bias, the concept of test fairness incorporates social values and philosophies of test use. We will demonstrate to the reader that, in the fi- nal analysis, the proper application of psychological tests is essentially an ethical conclusion that cannot be established on objective grounds alone.

In a classic article that deserves detailed scru- tiny, Hunter and Schmidt (1976) proposed the first clear distinction between statistical definitions of test bias and social conceptions of test fairness. Although the authors reviewed the usual technical criteria of test bias with incisive precision, their article is most famous for its description of three mutually incom- patible ethical positions that can and should affect test use.

Hunter and Schmidt (1976) noted that psy- chological tests are often used for institutional selection procedures such as employment or college admission. In this context, the application of test re- sults must be guided by a philosophy of selection. Unfortunately, in many institutions the selection philosophy is implicit, not explicit. Nonetheless, when underlying values are made explicit, three

and switched to interviewing Mr. Kim directly. She asked if he experienced memory difficulties. He re- sponded in a barely audible voice that he noticed “some” but that his daughter was “too bothered.”

At this point in the consultation, many psy- chologists would wonder if Mr. Kim was experienc- ing the onset of dementia. Typically, the practitioner might want to assess the mental status of the patient, perhaps using a test with good sensitivity and speci- ficity like the Mini-Mental State Exam (Folstein, Fol- stein, & McHugh, 1975). This is a simple measure with 30 scorable items of orientation, memory, and other cognitive skills. It is so easy that normal adults score in the range of 27 to 30 points. But Dr. Santiago resisted the temptation to jump straight into testing, recognizing that Mr. Kim likely would be further alienated and perform poorly for cultural reasons, regardless of his cognitive status.

Instead of administering a test that would yield invalid and biased results, the psychologist chose to offer tea to Mr. Kim and his daughter. Afterward, she engaged Mr. Kim alone in a socially oriented conversation about his extended family, looking for signs of cognitive impairment such as word-finding problems, confusion, or difficulty staying on topic. Within this relaxed atmosphere, a better picture of his performance emerged. His cognitive slips were minor, yet his mood conveyed deep and abiding sadness. Dr. Santiago suspected that Mr. Kim suf- fered from depression, which can cause significant cognitive impairment, especially in the elderly (Rep- permund, Brodaty, Crawford, and others, 2011). She offered no conclusions from this first consulta- tion, but left the door open for further assessment of Mr. Kim. In the meantime, she planned to confer with an experienced Korean American psychologist.

An important lesson from this case is that the cultural background of the patient impacts the suit- ability, validity, and bias of assessment methods. An instrument appropriate in one context may yield invalid, biased results in a different cultural milieu. Hayes (2008) concluded that

the psychologist initially misinterpreted the father’s emotional restraint, lesser eye con- tact, and apparent acceptance of his difficul- ties as signs of dementia. She later learned that

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 249

By definition, fair share quotas are based initially upon population percentages. Within rel- evant subpopulations, factors that predict future performance such as test scores would then be con- sidered. However, one consequence of quotas is that those selected do not necessarily have the highest scores on the predictor test.

Qualified Individualism

Qualified individualism is a radical variant of individualism:

This position notes that America is consti- tutionally opposed to discrimination on the basis of race, religion, national origin, or sex. A qualified individualist interprets this as an ethical imperative to refuse to use race, sex, and so on, as a predictor even if it were in fact scientifically valid to do so. (Hunter & Schmidt, 1976)

For selection purposes, the qualified individualist would rely exclusively on tested abilities, without reference to age, sex, race, or other demographic characteristics. This seems laudable, but examine the potential consequences. Suppose a qualified in- dividualist used SAT scores for purposes of college admission. Even though SAT scores for African Americans and Whites produce separate regression lines for the criterion of college grades, the quali- fied individualist would be ethically bound to use the single, less-accurate regression line derived for the entire sample of applicants. As a consequence, the future performance of African Americans would be overpredicted, which would seemingly boost the proportion of persons selected from this applicant group. With respect to selection ratios, the practical impact of qualified individualism is therefore mid- way between quotas and unqualified individualism.

Reprise on Test Fairness

Which philosophy of selection is correct? The truth is, this problem is beyond the scope of rational so- lution. At one time or another, each of the ethical stances outlined previously has been championed by wise, respected, and thoughtful citizens. However,

ethical positions can be distinguished. These posi- tions are unqualified individualism, quotas, and qualified individualism. Since these ethical stances are at the very core of public concerns about test fairness, we will review these positions in some detail.

Unqualified Individualism

In the American tradition of free and open competi- tion, the ethical stance of unqualified individualism dictates that, without exception, the best qualified candidates should be selected for employment, ad- mission, or other privilege. Hunter and Schmidt (1976) spell out the implications of this position:

Couched in the language of institutional selection procedures, this means that an or- ganization should use whatever information it possesses to make a scientifically valid pre- diction of each individual’s performance and always select those with the highest predicted performance. This position looks appealing at first glance, but embraces some implica- tions that most persons find troublesome. In particular, if race, sex, or ethnic group membership contributed to valid prediction of performance in a given situation over and above the contributions of test scores, then those who espouse unqualified individual- ism would be ethically bound to use such a predictor.

Quotas

The ethical stance of quotas acknowledges that many bureaucracies and educational institutions owe their very existence to the city or state in which they func- tion. Since they exist at the will of the people, it can be argued that these institutions are ethically bound to act in a manner that is “politically appropriate” to their location. The logical consequence of this po- sition is quotas. For example, in a location whose population is one-third African American and two- thirds White, selection procedures should admit candidates in approximately the same ratio. A selec- tion procedure that deviates consistently from this standard would be considered unfair.

250 Chapter 6 • Ability Testing: Group Tests and Controversies

Of course, the demonstration of substantial genetic influence for a trait does not imply that heredity alone is responsible for differences between individuals— environmental factors are formative, too, as reviewed subsequently.

The genetic contribution to human character- istics such as intelligence (as measured by IQ tests) is usually measured in terms of a heritability index that can vary from 0.0 to 1.0. The heritability index is an estimate of how much of the total variance in a given trait is due to genetic factors. Heritability of 0.0 means that genetic factors make no contribution to the variance in a trait, whereas heritability of 1.0 means that genetic factors are exclusively respon- sible for the variance in a trait. Of course, for most measurable characteristics, heritability is somewhere between the two extremes. McGue et al. (1993) dis- cuss the various methods for computing heritability based on twin and adoption studies.

It is important to stress that heritability is a population statistic that cannot be extended to ex- plain an individual score. Furthermore, heritability for a given trait is not a constant. As Jensen (1969) notes, estimates of heritability “are specific to the population sampled, the point in time, how the mea- surements were made, and the particular test used to obtain the measurements.” For IQ, most studies re- port heritability estimates right around .50, meaning that about half of the variability in IQ scores is from genetic factors. For some studies, the heritability of IQ is much higher, in the .70s (Bouchard, 1994; Bouchard, Lykken, McGue, Segal, & Tellegen, 1990; Pedersen, Plomin, Nesselroade, & McClearn, 1992).

Yet, the heritability of IQ defies any simple summary. For one thing, genetic influence on IQ appears to demonstrate an interaction effect with socioeconomic status (SES). Turkheimer et al. (2003) studied IQ results for 7-year-old twins, many living at or below the poverty level, others reared in middle class or higher families. The proportion of variance in IQ accounted for by genetic factors was inferred from the similarities/differences in IQ scores of identical versus fraternal twins. For families with the lowest levels of SES, environmental factors ac- counted for almost all of the variation in IQ. But in families with the highest levels of SES (middle and upper class), genetic factors accounted for almost

no consensus has emerged, and one is not likely to be found soon. The dispute reviewed here

is typical of ethical arguments—the resolu- tion depends in part on irreconcilable values. Furthermore, even among those who agree on values there will be disagreements about the validity of certain relevant scientific theo- ries that are not yet adequately tested. Thus, we feel that there is no way that this dispute can be objectively resolved. Each person must choose as he sees fit (and in fact we are di- vided). (Hunter & Schmidt, 1976)

When ethical stances clash—as they most certainly do in the application of psychological tests to selec- tion decisions—the court system may become the final arbiter, as discussed later in this book.

GENETIC AND ENVIRONMENTAL DETERMINANTS OF INTELLIGENCE

Genetic Contributions to Intelligence

The nature–nurture debate regarding intelligence is a well-known and overworked controversy that we will largely sidestep here. We concur with McGue, Bouchard, Iacono, and Lykken (1993) that a sub- stantial genetic component to intelligence has been proved by decades of adoption studies, familial re- search, and twin projects, even though individual studies may be faulted for particular reasons:

When taken in aggregate, twin, family, and adoption studies of IQ provide a demonstra- tion of the existence of genetic influences on IQ as good as can be achieved in the behav- ioral sciences with nonexperimental meth- ods. Without positing the existence of genetic influences, it simply is not possible to give a credible account for the consistently greater IQ similarity among monozygotic (MZ) twins than among like-sex dizygotic (DZ) twins, the significant IQ correlations among biological relatives even when they are reared apart, and the strong association between the magnitude of the familial IQ correlation and the degree of genetic relatedness. (p. 60)

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 251

Perhaps molecular geneticists need those numbers to guide their search for the underly- ing genes? Perhaps clinical psychologists need those numbers to guide their selection of ther- apies that work? Or perhaps educators need those numbers to guide their choice of teach- ing interventions that will be successful? We have seen no indication of the usefulness of the heritability numbers for any of those pur- poses. Indeed, it has been widely recognized that malleability is not the opposite of herita- bility. (Kamin & Goldberger, 2001, p. 28)

In sum, traits with high heritability might still prove to be malleable in the face of environmental factors. If this is so, what constructive purpose is served by the flood of heritability estimates found in the re- search literature?

Thus, we must avoid the tendency to view any corpus of research in a simplistic either/or frame of mind. Even the most diehard hereditarians acknowl- edge that a person’s intelligence is shaped also by the quality of experience. The crucial question is: To what extent can enriched or deprived environments modify intelligence upward or downward from the genetically circumscribed potential? The reader is reminded that the genetic contribution to intel- ligence is indirect, most likely via the gene-coded physical structures of the brain and nervous system. Nonetheless, the brain is quite malleable in the face of environmental manipulations, which can even al- ter its weight and the richness of neuronal networks (Greenough, Black, & Wallace, 1987). How much can such environmental impacts sway intelligence as measured by IQ tests? We will review several studies indicating that environmental extremes help deter- mine intellectual outcome within a range of approx- imately 20 IQ points, perhaps more.

Environmental Effects: Impoverishment and Enrichment

First, we examine the effects of environmental dis- advantage. Vernon (1979, chap. 9) has reviewed the early studies of severe deprivation, noting that children reared under conditions in which they re- ceived little or no human contacts can show striking

all of the variation in IQ. These striking results have been only partially confirmed by other twin stud- ies. The interaction effect is minimal in studies con- ducted in other countries (Nisbett et al. 2012).

If genuine, as appears to be the case in the United States, the interaction between SES and heritability, with IQ revealing little genetic influ- ence for low SES children, carries important policy implications:

One interpretation of the finding that heritabil- ity of IQ is very low for lower SES individuals is that children in poverty do not get to develop their full genetic potential. If true, there is room for interventions with that group to have large effects on IQ (Nisbett et al., 2012, p. 134).

We investigate the impact of enriched environments such as early educational intervention in a later topic.

A most fascinating demonstration of the ge- netic contribution to IQ is found in the Minnesota Study of Twins Reared Apart (Segal, 2012). In this on- going study, identical twins reared apart are reunited for extensive psychometric testing. Bouchard (1994) reports that the IQs of identical twins reared apart correlate almost as highly as those of identical twins reared together, even though the twins reared apart often were exposed to different environmental con- ditions (in some cases, sharply contrasting environ- ments). In sum, differences in environment appeared to cause very little divergence in the IQs of identical twin pairs reared apart. These findings strongly sug- gest a genetic contribution to intelligence, with herita- bility estimated in the vicinity of .70.

The Minnesota Study and other twin studies have been criticized on methodological and philo- sophical grounds. Methodologically, one concern is that identical twins separated early in life for adoption might be placed in highly similar environments, which would inflate the estimated genetic influence when re- united and tested in adulthood. Philosophically, some skeptics question the utility and purpose of churning out one heritability estimate after another:

It is not apparent what scientific purposes are served by the sustained flow of heritabil- ity numbers for psychological characteristics.

252 Chapter 6 • Ability Testing: Group Tests and Controversies

primarily White, from economically advantaged communities, and reared by a married mother with college education. As the authors note, “the sampling design provided for a comparison of populations with starkly contrasting social conditions.” (p. 712)

The mean IQ scores for all samples at both times of testing (age 6 and age 11) are depicted in Figure 6.11. The reader will observe that suburban samples scored higher than inner city samples, and that normal birth weight children scored higher than low-birth-weight children. These results are not especially remarkable—the negative impacts of low birth weight and economic disadvantage are well documented in the literature on group differences in IQ outcomes (e.g., Breslau, 1994; Ceci, 1996). What is noteworthy about the results—one might even say astonishing—is that both of the inner city samples

improvements in IQ—as much as 30 to 50 points— when transferred to a more normal environment. Yet, we must regard this body of research with some skepticism, owing to the typically exceptional con- ditions under which the initial tests were admin- istered. Can a meaningful test be administered to 7-year-old children raised almost like animals (Ko- luchova, 1972)?

Typical of this early research is the follow-up study by Skeels (1966) of 25 orphaned children origi- nally diagnosed as having mental retardation (Skeels & Dye, 1939). These children were first tested at ap- proximately 11/2 years of age when living in a highly unstimulating orphanage. Thirteen of them were then transferred to another home where they re- ceived a great deal of supervised, doting attention from older girls with mental retardation. These chil- dren showed a considerable increase in IQ, whereas the 12 who remained behind decreased further in IQ. When traced at follow-up 26 years later, the 13 trans- ferred cases were normal, self-supporting adults, or were married. The other subjects—the contrast group—were still institutionalized or in menial jobs. The enriched group showed an average increase of 32 IQ points when retested with the Stanford-Binet, whereas the contrast group fell below their origi- nal scores. Even though we are disinclined to place much credence in the original IQ scores and might, therefore, quarrel with the exact magnitude of the change, the Skeels (1966) study surely indicates that the difference between a severely depriving early en- vironment and a more normal one might account for perhaps 15 to 20 IQ points.

More recently, Breslau, Chilcoat, Susser, and others (2001) conducted a rigorous longitudinal study that illustrates the detrimental impact of grow- ing up in a racially segregated and economically dis- advantaged community. Using the WISC-R, they collected longitudinal IQ scores at age 6 and age 11 for large samples of urban and suburban children, some low birth weight (≤2500 grams) and some nor- mal birth weight (>2500 grams). The urban samples were primarily Black, from inner city Detroit, and reared by a single mother with high school (or less) education. These children typically experienced eco- nomic deprivation, inferior education, family stress, and racial segregation. The suburban samples were

FIGURE 6.11 Average IQ Scores for Urban and

Suburban Children at Age 6 and Age 11

S-N: Suburban Normal Birth Weight

S-L: Suburban Low Birth Weight

U-N: Urban Normal Birth Weight

U-L: Urban Low Birth Weight

Source: Based on data in Breslau, N., Chilcoat, H., Susser, E.,

and others (2001). Stability and change in children’s

Intelligence Quotient scores: A comparison of two

socioeconomically disparate communities. American

Journal of Epidemiology, 154, 711–717.

120

110

100

A ve

ra g

e I Q

Age 6 Age 11

S-N

S-L

U-N

U-L

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 253

more than the 5- to 10-point IQ decrement reported by Jensen (1977).

Scarr and Weinberg (1976, 1983) reversed the question probed by Jensen (1977), namely, they asked: What happens to their intelligence when Afri- can American children are adopted into the relatively enriched environment provided by economically and educationally advantaged White families? As discussed later, it is well known that African Ameri- can children reared by their own families obtain IQ scores that average about 15 points below Whites (Jensen, 1980). Some portion of this difference— perhaps all of it—is likely due to the many social, economic, and cultural differences between the two groups. We put that issue aside for now. Instead, we pursue a related question that bears on the mal- leability of IQ: What difference does it make when African American children are adopted into a more economically and educationally advantaged environment?

Scarr and Weinberg (1976, 1983) found that 130 African American and interracial children ad- opted into upper-middle-class White families av- eraged a Full Scale IQ of 106 on the Stanford-Binet or the WISC, a full 6 points higher than the na- tional average and some 18 to 21 points higher than typically found with African American examinees. African American children adopted early in life, be- fore 1 year of age, fared even better, with a mean IQ of 110. We can only wonder what the IQ scores would have been if the adoptions had taken place at birth and if excellent prenatal care had been provided. This study indicates that when the early environment is optimal, IQ can be boosted by perhaps 20 points.

Limitations of space prevent us from further detailed discussion of environmental effects on IQ. It is worth noting, though, that a huge literature has emerged from early intervention and enrichment- stimulation studies of children at risk for school failure and mental retardation (e.g.,Barnett & Camilli, 2002; Ramey & Ramey, 1998). In general, these studies show that intervention and enrichment can boost IQ in children at risk for school failure and mental retardation. Summarizing four decades of re- search, Ramey and Ramey (1998) extracted six prin- ciples from the research on early intervention for at-risk children. They refer to these as “ remarkable

(low birth weight and normal birth weight) appar- ently lost an average of 5 IQ points during the five years between initial testing at age 6 and follow-up testing at age 11. In contrast, the suburban samples held constant in IQ during the same time period. It is difficult to conceive a benign explanation for these findings. Apparently, growing up in the poverty, segregation, and turmoil of the inner city imposes hardships that lead to a decline in IQ scores from age 6 to age 11. The authors summarize the signifi- cance of their study as follows:

On average, the IQs of urban children declined by more than 5 points. A change of 5 points in an individual child might be judged by some as clinically nonsignificant. Nevertheless, a change of this size in a population’s mean IQ, which reflects a downward shift in the distri- bution (rather than a change in the shape of the distribution), means that the proportion of children scoring 1 standard deviation or more below the standardized IQ mean of 100 would increase substantially. In this study, the change from age 6 to age 11 years increased the per- centage of urban children scoring less than 85 on the WISC-R from 22.2 to 33.2. (Breslau et al., 2001, p. 716)

Sadly, the apparent drop of 5 points in average IQ from age 6 to age 11 found in this study may repre- sent only part of the overall impact of environmental deprivation. The full effect over a lifetime could be substantially greater.

Jensen (1977) found similar results in a meth- odologically novel study of severely impoverished African American children in rural Georgia. Com- paring older and younger siblings on the California Test of Mental Maturity (CTMM), he found that children from this setting, which was “as severely disadvantaged, educationally and economically, as can be found anywhere in the United States,” ap- peared to lose up to one IQ point a year, on aver- age, between the ages of 6 and 16. The cumulative loss totaled 5 to 10 IQ points. Furthermore, if we factor in the probable IQ deficit that occurred be- tween birth and age 5, we can surmise that the over- all effect of a depriving environment is probably

254 Chapter 6 • Ability Testing: Group Tests and Controversies

expanded to children from birth to 5 years of age. In 2012, funding for Head Start was approximately $8 billion. These funds provided a broad range of services including preschool education centers for low-income families, child care homes, medi- cal and dental services, and home-based consulta- tion by developmental experts. Over one million infants and children receive Head Start services each year. Low-income pregnant women also are eligible for services. Interventions are designed to be culturally sensitive and involve the parents as much as possible. School readiness is the overrid- ing goal, which is facilitated through the support of cognitive, language, physical, social, and emotional development.

Zhai, Brooks-Gunn, and Waldfogel (2011) re- cently completed a study of school readiness in 2,803 Head Start children from 18 cities. When compared with children from any other child care arrange- ment, children in Head Start demonstrated, at age 5, gains in cognitive development as measured by the Peabody Picture Vocabulary Test-III and a letter- word identification task, improvements in social competence as measured by a subscale from the Adaptive Social Behavior Inventory (Hogan, Scott, & Bauer, 1992), and reductions in their attention problems as measured by a subscale from the Child Behavior Checklist (CBCL, Achenbach & Rescorla, 2000). There were no statistically significant effects on internalizing or externalizing behavior problems on the CBCL. The researchers emphasize that Head Start impacts more than cognitive development. It also enhances attentional and emotional skills essen- tial for school readiness.

Teratogenic Effects on Intelligence and Development

In normal prenatal development, the fetus is pro- tected from the external environment by the pla- centa, a vascular organ in the uterus through which the fetus is nourished. However, some substances known as teratogens cross the placental barrier and cause physical deformities in the fetus. Espe- cially if the deformities involve the brain, terato- gens may produce lifelong behavioral disorders, including low IQ and mental retardation. The list of

consistencies in the major findings” on intervention studies:

1. Interventions that begin earlier (e.g., during infancy) and continue longer provide the best benefits to participating children.

2. More-intensive interventions (e.g., number of visits per week) produce larger positive effects than less-intensive interventions.

3. Direct enrichment experiences (e.g., working directly with the kids) provide greater impact than indirect experiences.

4. Programs with comprehensive services (e.g., multiple enhancements) produce greater posi- tive changes than those with a narrow focus.

5. Some children (e.g., those with normal birth weight) show greater benefits from participation than other children.

6. Initial positive benefits diminish over time if the child’s environment does not encourage positive attitudes and continued learning.

One concern about early intervention pro- grams is their cost, which has been excessive for some of the demonstration projects. Skeptics won- der about the practicality and also the ultimate pay- off of providing extensive, broad-based, continuing intervention virtually from birth onward for the millions of children at risk for developmental prob- lems. This is a realistic concern because “relatively few early intervention programs have received long- term follow-up” (Ramey & Ramey, 1998). Critics also wonder if the programs merely teach children how to take tests without affecting their underlying intelligence very much (Jensen, 1981). Finally, there is the issue of cultural congruence. Intervention pro- grams are mainly designed by White psychologists and then applied disproportionately to minority children. This is a concern because programs need to be culturally relevant and welcomed by the con- sumers, otherwise the interventions are doomed to failure.

One popular intervention program is Head Start, created in 1965 and funded continuously by the federal government. The original program provided comprehensive services for children 3 to 5 years of age. In 1995, with the inception of Early Head Start under President Clinton, coverage was

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 255

best advice to pregnant women is to refrain entirely from alcohol. A child with FASD might function in the borderline range of intelligence and manifest poor coordination, difficulty with concept forma- tion, hyperactivity, and problems with executive functions. In the absence of intervention, the conse- quences to the child, the family, and society are pro- found, as confirmed by Streissguth, Bookstein, Barr, and others (2004). They studied 415 children and adults with confirmed FASD, searching patient re- cords and interviewing knowledgeable informants. The median IQ of the group was 86, with a range of 29 to 126. Most were young (median age of 14, range 6 to 51), but many had reached adolescence and adulthood. For these older individuals, 60 percent had experienced trouble with the law, 50 percent had been in a jail, prison, or inpatient setting, 49 percent had engaged in inappropriate sexual behaviors, and 35 percent experienced alcohol or drug problems. In spite of these markers of turmoil and social disrup- tion, early diagnosis of FASD and placement in a stable environment dramatically reduced the likeli- hood of these adverse outcomes.

FASD likely is more common than previously thought. According to the Centers for Disease Control and Prevention (CDC, 2012), 7.6 percent of pregnant women report using alcohol, includ- ing 1.4% who engage in binge drinking (6 or more drinks per occasion). These data probably underesti- mate alcohol intake during pregnancy, because some women will be reluctant to report honestly on their drinking. Clearly, a small proportion of pregnant women continue to drink, in spite of widespread public health warnings. As a result, FASD persists as a public health problem.

Many affected children do not show the characteristic facial anomalies and therefore never receive proper diagnosis and early intervention. In a thorough study of elementary school children in two counties in Washington State, Clarren, Randels, Sanderson, and Fineman (2001) found that only 1 in 7 children with FAS had been previously diagnosed. Based on epidemiological findings and the conver- gence of evidence from several research methods, May, Gossage, Kalberg, and others (2009) concluded that the current prevalence of FASD among younger school children may be as high as 2 to 5 percent

potential teratogens is almost endless and includes prescription drugs, hormones, illicit drugs, smoking, alcohol, radiation, toxic chemicals, and viral infec- tions (Berk, 1989; Martin, 1994). We will briefly highlight the most prevalent and also the most pre- ventable teratogen of all, alcohol.

Heavy drinking by pregnant women causes their offspring to be at very high risk for fetal alco-

hol syndrome (FAS), a specific cluster of abnormali- ties first described by Jones, Smith, Ulleland, and Streissguth (1973). Intelligence is markedly lower in children with FAS. When assessed in adolescence or adulthood, about half of all persons with this disor- der score in the range of mental retardation on IQ tests (Olson, 1994). Prenatal exposure to alcohol is one of the leading known causes of mental retarda- tion in the Western world. The defining criteria of FAS include the following:

1. Prenatal and/or postnatal growth retardation— weight below the tenth percentile after cor- recting for gestational age

2. Central nervous system dysfunction—skull or brain malformations, mild to moderate men- tal retardation, neurological abnormalities, and behavior problems

3. Facial dysmorphology—widely spaced eyes, short eyelid openings, small up-turned nose, thin upper lip, and minor ear deformities ( Sokol & Clarren, 1989)

The full-blown FAS syndrome occurs mainly in off- spring of women alcoholics—those who ingest many drinks per occasion.

Children exposed to lesser levels of alcohol during pregnancy may manifest a range of con- sequences known collectively as Fetal Alcohol Spectrum Disorder (FASD) (Bertrand, Floyd, Weber, and others, 2004). FASD is an unofficial umbrella term that encompasses the entire range of adverse consequences. These outcomes include full-blown FAS, the most devastating result of pre- natal exposure to alcohol, and other manifestations referred to with terms such as fetal alcohol effect, alcohol-related neurodevelopmental disorder, and similar designations. Even though the existence of adverse effects from prenatal exposure to low or moderate drinking is still disputed (Abel, 2009), the

256 Chapter 6 • Ability Testing: Group Tests and Controversies

that “asymptomatic” lead exposure was associated with decrements in overall intelligence (about 4 IQ points) and lowered performance on verbal subtests, auditory and speech processing tests, and a reac- tion time measure of attention. These differences persisted at follow-up 11 years later (Needleman, Schell, Bellinger, Leviton, & Allred, 1990). Yet, using a similar study method, Smith, Delves, Lansdown, Clayton, and Graham (1983) found a nonsignificant effect from children’s lead exposure when social fac- tors such as the parents’ level of education and social status were controlled.

In part, research findings on this topic are contradictory because it is difficult to disentangle the effects of lead from those of poverty, stress, poor nu- trition, and other confounding variables (Kaufmann, 2001a, b). Most likely, asymptomatic lead exposure has harmful effects on the nervous system that trans- late to reduced intelligence, impaired attention, and a host of other undesirable behavioral consequences.

Recent studies continue to raise alarm about the impact of very low levels of lead exposure on the behavioral and neurocognitive functioning of children. Marcus et al. (2010) completed a meta-anal- ysis of 19 studies on lead (from hair samples) and behavior problems in 8,561 children. The average correlation across all studies was r = .19 (p < .001), that is, the higher the lead level, the greater the severity of conduct problems. Strayhorn and Strayhorn (2012) studied achievement scores in re- lation to elevated blood lead levels in children for the 57 counties of New York State, using family in- come as a covariate. Achievement scores were taken from state-wide English and mathematics test- ing conducted in the third and eighth grades. The partial correlations between incidence of elevated lead and number of children in the lowest achieve- ment levels ranged from .29 to .40 (p < .05). The researchers found a direct linear relationship: for each one percent increase in children with lead lev- els elevated beyond the official CDC limit, there was a corresponding one percent increase in children in the lowest achievement group.

These recent studies probably help explain why the CDC recently lowered the level of accept- able blood lead burden from 10 to 5 μg/dL, the first change in 20 years (New York Times, May 17,

in the United States and some western European countries. The social, health, and economic con- sequences of these estimated prevalence rates are cause for concern.

Effects of Environmental Toxins on Intelligence

Many industrial chemicals and by-products may impair the nervous system temporarily, or even cause permanent damage that affects intelligence. Examples include lead, mercury, manganese, arsenic, thallium, tetra-ethyl lead, organic mercury compounds, methyl bromide, and carbon disul- phide (Lishman, 1997). Long-term exposure to organophosphate pesticides such as encountered by some farm workers is known to cause neurobe- havioral deficits in memory, fine motor control, response speed, and mental flexibility (Mackenzie Ross, Brewin, & Curran, 2010; Roldán-Tapia, Par- rón, & Sánchez-Santed, 2005). Certainly, the most widely studied of these environmental toxins is lead, which we examine in modest detail here.

Sources of human lead absorption include eating of lead-pigmented paint chips by infants and toddlers; breathing of particulate lead from smelter emissions; eating of food from lead-soldered cans or lead-glazed pottery; and the drinking of water that has passed through lead pipes. Because the human body excretes lead slowly, most citizens of the in- dustrialized world carry a lead burden substantially higher—perhaps 500 times higher—than known in pre-Roman times (Patterson, 1980).

The hazards of high-level lead exposure are acknowledged by every medical and psychologi- cal researcher who has investigated this topic. High doses of lead are irrefutably linked to cerebral palsy, seizure disorders, blindness, mental retardation, even death. The more important question pertains to “asymptomatic” lead exposure: Can a level of ab- sorption that is insufficient to cause obvious medi- cal symptoms nonetheless produce a decrement in intellectual abilities?

Research findings on this topic are complex and controversial. Using tooth lead from shed teeth of young children as their index of cumulative lead burden, Needleman and associates (1979) reported

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 257

by 141/2 points (Reynolds, Chastain, Kaufman, & McLean, 1987). In the standardization sample for the fourth edition of the Stanford-Binet (Thorndike, Hagen, & Sattler, 1986) a difference of about 171/2 points (mean of 103.5 versus 86.1) was observed. For these early studies, when demographic variables such as socioeconomic status are taken into account, the size of the mean difference reduces to .5 to .7 standard deviations (7 to 10 IQ points) but does not disappear (Reynolds & Brown, 1984a). Put simply, the existence of race differences in IQ has been re- ported with such consistency that is it no longer the focus of serious dispute.

However, the interpretation of race differences in IQ is an issue of fierce ongoing debate. Why the disparity exists, what it means from a practical stand- point, and whether the gap is narrowing—all these topics engender a full range of opinions (Fagan & Holland, 2007; Rushton & Jensen, 2005). We begin our discussion with the question of origins—what are the causes of the Black-White IQ difference?

One viewpoint (discussed previously) is that the observed IQ disparity is caused, partly or wholly, by test bias. This is a popular and widely held view- point rarely supported by technical studies of test bias. Test bias may play a small role in race differ- ences, but it cannot explain the persistent difference in IQ scores between Black and White Americans. Here we intend to examine a different hypothesis; namely: Is the IQ difference between Black and White Americans due, in significant part, to genetic sources?

The Genetic Hypothesis for Race Differences in IQ

The hypothesis of a genetic basis for race differences in IQ first gained scholarly prominence in 1969 when Arthur Jensen published a provocative paper titled “How Much Can We Boost IQ and Scholastic Achievement?” (Jensen, 1969). Jensen set the tone for his paper in the opening sentence when he as- serted that “compensatory education has been tried and it apparently has failed.” He further contended that compensatory education programs were based on two fallacious theoretical underpinnings, namely, the “average child concept,” which views children as

2012, CDC lowers recommended lead-level limits in children). The current level, 5 μg/dL, is an exceed- ingly small level of exposure. One μg (microgram) is one-millionth of a gram, and a dL (deciliter) is one- tenth of a liter or almost half a cup.

In addition to the health burden from lead exposure, the overall national costs are substantial, as outlined in a recent social policy report from the Society for Research in Child Development (SRCD):

Children’s exposure to lead is expensive, in- curring costs associated with health care and losses associated with lowered intellectual de- velopment, earnings, and tax contributions. One study put the overall cost of exposure in children 6 and under at $192 to $270 bil- lion over six years. Another cost analysis con- cluded that reducing children’s blood lead levels just 1 μg/dL would save $7.56 billion an- nually (SRCD, 2010, p. 2).

Prudence dictates that we should reduce lead expo- sure in humans to the lowest levels possible.

ORIGINS AND TRENDS IN RACIAL IQ DIFFERENCES

Early Studies of African American and White IQ Differences

Racial differences in IQ have been recorded since the beginnings of standardized testing. The most widely studied disparity is between African American and White samples, where a discrepancy favoring Whites of about one standard deviation (15 points) is histor- ically reported. We should add that the term Black is used interchangeably with African American, and that White refers to non-Hispanic White individu- als. The IQ difference fluctuates from one analysis to the next—as small as 10 points in a few studies but as large as 20 points in others. For example, in the 1960 restandardization of the Stanford-Binet, the White sample (M = 101.8) outscored the Black sam- ple (M = 80.7) by slightly more than 20 IQ points (Kennedy, Van de Riet, & White, 1963). A lesser difference was revealed on the 1981 WAIS-R where Whites (M = 101.4) outscored Blacks (M = 86.9)

258 Chapter 6 • Ability Testing: Group Tests and Controversies

leaving, unemployment, illegitimacy, crime, and a host of other social pathologies. But two chapters on ethnic differences in intelligence caused an uproar among social scientists and the lay public. The au- thors reviewed dozens of studies and concluded that the IQ gap between African Americans and Whites has changed little in this century. They also argued that test bias cannot explain the race differences. Furthermore, they noted that races differ not just in average IQ scores but also in the profile of intellec- tual abilities. In addition, they concluded that intel- ligence is only slightly malleable even in the face of intensive environmental intervention. As did Jen- sen, Herrnstein and Murray (1994) stated their ge- netic hypothesis with considerable circumspection:

It seems highly likely to us that both genes and the environment have something to do with racial differences [in cognitive ability]. What might the mix be? We are resolutely agnostic on that issue; as far as we can determine, the evidence does not yet justify an estimate.

Although the authors declined to provide an estimate of the genetic contribution to race differ- ences in IQ, it is clear from the tone of their pes- simistic book that they believe it to be substantial. Recently, Arthur Jensen has reentered the debate on the origins of IQ differences between African Americans and White Americans and reaffirmed his earlier judgment that the disparity is “partly her- itable” (Rushton & Jensen, 2005). Is this conclusion warranted by the evidence?

Tenability of the Genetic Hypothesis

The genetic hypothesis for race IQ differences is an unpopular idea that is anathema to many laypersons and social scientists. But contempt for an idea does not constitute disproof, and superficiality is no sub- stitute for a reasoned examination of evidence. In light of additional analysis and research, is the ge- netic hypothesis for IQ differences tenable? We will examine three lines of evidence here that indicate that the answer is “no.”

Several critics have pointed out that the genetic hypothesis is based on the questionable assumption

more or less homogeneous, and the “social depriva- tion hypothesis,” which asserts that environmental deprivation is the primary cause of lowered achieve- ment and IQ scores. Jensen argued forcefully against both suppositions. Furthermore, leaning heavily on the literature in behavior genetics, Jensen implied that the reason Whites scored higher than African Americans on IQ tests was probably related more to genetic factors than to the effects of environmental deprivation. The thrust of his paper was to suggest that, since compensatory education has proved inef- fectual, and since the evidence suggests a strong ge- netic component to IQ, therefore, it is appropriate to entertain a genetic explanation for the well-docu- mented difference in favor of Whites on IQ tests. He formulated the genetic hypothesis in a careful, tenta- tive, scholarly manner:

The fact that a reasonable hypothesis has not been rigorously proved does not mean that it should be summarily dismissed. It only means that we need more appropriate research for putting it to the test. I believe such definitive research is entirely possible but has not been done. So all we are left with are various lines of evidence, no one of which is definitive alone, but which, viewed all together, make it a not unreasonable hypothesis that genetic factors are strongly implicated in the average Negro-white intelligence difference. The pre- ponderance of the evidence is, in my opinion, less consistent with a strictly environmental hypothesis than with a genetic hypothesis, which, of course, does not exclude the influ- ence of environment or its interaction with ge- netic factors. (Jensen, 1969)

With the articulation of a genetic hypothesis for race differences in IQ, Jensen provoked an intense debate that has raged on, with periodic lulls, to the present day.

In the mid-1990s the controversy over a ge- netic basis for race differences in IQ was intensified once again with the publication of The Bell Curve by Richard Herrnstein and Charles Murray (1994). This massive tome was primarily a book about the importance of IQ as a predictor of poverty, school

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 259

IQ differences were almost completely eliminated. Their study suggests that previous research has un- derestimated the pervasive effects of poverty and its cofactors as a contribution to African American and White IQ differences.

A third criticism of the genetic hypothesis is that race as a biological entity is simply nonexistent, that is, there are no biological races. Fish (2002) and other pro- ponents of this viewpoint argue that “race” is a socially constructed concept, not a biological reality:

Homo sapiens has no extant subspecies: There are no biological races. Human physical ap- pearance varies gradually around the planet, with the most geographically distant peoples generally appearing the most different from one another. The concept of human biological races is a construction socially and historically localized to 17th and 18th-century European thought. Over time, different cultures have developed different sets (folk taxonomies) of socially defined “races.” (p. 29)

Put another way, racial categories are social con- structions based on superficial physical differences ( especially skin color) that serve cultural-psychological objectives (e.g., reducing uncertainty about how we should respond to one another). However, racial cat- egories do not signify meaningful biological differ- ences. A biologist expresses the point this way: “All of humanity shares in common the vast majority of its molecular genetic variation and the adaptive traits that define us as a single species” (Templeton, 2002, p. 51). Thus, insofar as race has no biological reality, the argument that “race” differences in IQ originate from a genetic basis is not only pernicious, it is also absurd. Neisser, Boodoo, Bouchard, and others (1996) offer additional perspectives on race differences in IQ and related topics.

Before leaving the topic of race differences in IQ, we should point out that the emotion attached to this topic is largely undeserved, for two reasons. First, racial groups always show large overlaps in IQ— meaning that the peoples of the earth are much more alike than they are different. Second, as previously noted, the existing race differences in IQ certainly reflect cultural differences and environmental factors

that evidence of IQ heritability within groups can be used to infer heritability between racial groups. Jensen (1969) expressed this premise rather explic- itly, pointing to the substantial genetic component in IQ as suggestive evidence that differences in IQ between African Americans and White Americans are, in part, genetically based. Echoing earlier critics, Kaufman (1990) responds as follows:

One cannot infer heritability between groups from studies that have provided evidence of the IQ’s heritability within groups. Even if IQ is equally heritable within the black and white races separately, that does not prove that the IQ differences between the races are genetic in origin. Scarr-Salapatek’s (1971, p. 1226) simple example explains this point well: Plant two ran- domly drawn samples of seeds from a geneti- cally heterogeneous population in two types of soil—good conditions versus poor conditions— and compare the heights of the fully grown plants. Within each type of soil, individual vari- ations in the heights are genetically determined; but the average difference in height between the two samples is solely a function of environment.

Another criticism of the genetic hypothesis is that careful analysis of environmental factors provides a sufficient explanation of race differences in IQ, that is, the genetic hypothesis is simply unneces- sary. This is the approach taken by Brooks-Gunn, Klebanov, and Duncan (1996) in a study of 483 African American and White low birth weight chil- dren. What makes their study different from other similar analyses is the richness of their data. Instead of using only one or two measures of the environ- ment (e.g., a single index of poverty level), they col- lected longitudinal data on income level and many other cofactors of poverty such as length of hospital stay, maternal verbal ability, home learning environ- ment, neighborhood condition, and other compo- nents of family social class. When the children’s IQs were tested at age 5 with the WPPSI, the researchers found the usual disparity between the White chil- dren (mean IQ of 103) and the African American children (mean IQ of 85). However, when poverty and its cofactors were statistically controlled, the

260 Chapter 6 • Ability Testing: Group Tests and Controversies

Overall, the average IQ for Black schoolchildren was estimated to be 90.5 in 2002, indicating that Black children have made large IQ gains relative to Whites since the 1960s. Dickens and Flynn (2006) conclude that further Black economic progress would produce additional gains in IQ. This conclusion provides an optimistic outlook on a contentious social issue.

AGE CHANGES IN INTELLIGENCE

We turn now to another controversial topic— whether intelligence declines with age. Certainly, one of the most pervasive stereotypes about aging is that we lose intellectual ability as we grow older. This stereotype is so pervasive that few laypersons question it. But we should question it.

In general, the empirical study of this topic provides a more optimistic conclusion than the common stereotype suggests. However, the research also reveals that age changes in intelligence are com- plex and multifaceted. The simple question, “Does intelligence decline with age?” turns out to have sev- eral labyrinthine answers.

We can trace the evolution of research on age- related intellectual changes as follows:

1. Early cross-sectional research with instruments such as the WAIS painted a somber picture of a slow decline in general intelligence after age 15 or 20 and a precipitously accelerated descent after age 60.

2. Just a few years later, more sophisticated studies using sequential testing with multidi- mensional instruments such as the Primary Mental Abilities Test suggested a more op- timistic trajectory for intelligence: minimal change in most abilities until at least age 60.

3. Parallel research utilizing the fluid/crystallized distinction posited a gradual increase in crystal- lized intelligence virtually to the end of life, juxta- posed against a rapid decline in fluid intelligence.

4. Most recently, a few psychologists have pro- posed that adult intelligence is qualitatively dif- ferent, akin to a new Piagetian stage that might be called postformal reasoning. This research calls into question the ecological validity of us- ing standard instruments with older examinees.

to a substantial degree. Wilson (1994) has catalogued the numerous differences in cultural background be- tween African Americans and White Americans. In 1992, for example, 64 percent of African American parents were divorced, separated, widowed, or never married; 63 percent of African American births were to unmarried mothers; and 30 percent of African American births were to adolescents (U.S. Bureau of the Census, 1993). On average, these realities of family life for many African Americans inevitably will lead to lowered performance on intelligence tests. Lest the reader conclude that we are hereby endorsing a subtle form of Anglocentric superiority, consider Lynn’s (1987) conclusion that the mean IQ of the Japanese is 107, a full 7 points higher than the average for American Whites. So what?

Recent Trends in Race Differences in IQ

An important question is whether Black–White IQ differences have remained stable over recent decades (which could support a genetic basis for the IQ dis- parity) or whether the gap has narrowed in response to environmental progress (which could indicate a substantial ecological source for the IQ disparity). The former conclusion (stability of the IQ differ- ence) has been stated by Jensen and others who hy- pothesize, in part, a genetic basis for the discrepancy (Jensen, 1980; Jensen & Rushton, 2005).

In contrast, a recent analysis by Dickens and Flynn (2006) supports a significant narrowing of the racial IQ gap. These researchers considered com- parative longitudinal data for Black and White ex- aminees for the period 1970 to 2000 with successive editions of four carefully standardized instruments: The Stanford-Binet, the Wechsler Intelligence Scale for Children, the Wechsler Adult Intelligence Scale, and the Armed Forces Qualifying Test. Their find- ings are complex and statistically laden, but here is the big picture: on all four instruments, Blacks gained in IQ compared to Whites during 1970 to 2000, the average gain amounting to 4 to 7 IQ points. The authors conclude:

The constancy of the Black-White IQ gap is a myth and therefore cannot be cited as evidence that the racial IQ gap is genetic in origin. (p. 917)

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 261

Overlooked by Wechsler and many other cross-sectional design researchers was the influence of their methodology on their findings. It has been recognized for quite some time that cross-sectional studies often confound age effects with educational disparities or other age-group differences (see Baltes, Reese, & Nesselroade, 1977; Kausler, 1991). For ex- ample, in the normative studies of the Wechsler tests, it is invariably true that the younger standard- ization subjects are better educated than the older ones. In all likelihood, the lower scores of the older subjects are caused, in part, by these educational dif- ferences rather than signifying an inexorable age- related decline.

Sequential Studies of Intelligence

To control for age-group differences, many re- searchers prefer a longitudinal design in which the same subjects are retested one or more times over periods of 5 to 10 years and, in rare cases, up

We examine each of these research epochs in more detail in the following sections.

Early Cross-Sectional Research

One of the earliest comprehensive studies of age trends on an individually administered intelligence test was reported by Wechsler (1944) shortly after publication of the Wechsler-Bellevue Form I. As is true of all the Wechsler tests designed for adults, raw scores on the W-B I subtests were first transformed into standard scores (referred to as scaled scores) with a mean of 10 and standard deviation of 3. Re- gardless of the age of the subject, these scaled scores were based on a fixed reference group of 350 sub- jects ages 20 to 34 included in the standardization sample. By consulting the appropriate age table, the sum of the 11 scaled scores was then used to find an examinee’s IQ.

However, the sum of the scaled scores by it- self is a direct index of an examinee’s ability relative to the reference group. Wechsler used this index to chart the relationship between age and intelligence. His results indicated a rapid growth in general intel- ligence in childhood through age 15 or 20, followed by a slow decline to age 65. He was characteristically blunt in discussing his findings:

If the fact that intellectual growth stops at about the age of fifteen has been a hard fact to accept, the indication that intelligence af- ter attaining its maximum forthwith begins to decline just as any other physiological capac- ity, instead of maintaining itself at its high- est level over a long period of time, has been an even more bitter pill to swallow. It has, in fact, proved so unpalatable that psychologists have generally chosen to avoid noticing it. (Wechsler, 1952)

Normative studies with subsequent Wechsler adult tests revealed exactly the same pattern. For example, results for the WAIS-IV have been computed in Figure 6.12, which shows the average uncorrected subtest scores for all age groups in the normative sample, relative to results for the highest scoring age group (25- to 29-year-olds).

FIGURE 6.12 The Curve of Supposed Age-Related

Decline in Average WAIS-IV Subtest Scores

Source: Based on data in Wechsler, D. (2008). Manual

for the Wechsler adult intelligence scale—fourth

edition. San Antonio, TX: Pearson.

1 6

– 1 9

2 0

– 2

2 5

– 2

3 0

– 3

3 5

– 4

4 5

– 5

5 5

– 6

6 5

– 6

7 0

– 74

7 5

– 7

8 0

– 8

8 5

– 9

A ve

ra g

e S

u b

te s t

S c o re

i n R

e la

ti o n t o A

g e s 2

5 – 2 9

Age Group

262 Chapter 6 • Ability Testing: Group Tests and Controversies

Three conclusions emerged from Schaie’s cross-sequential study of adult mental abilities:

1. Each cross-sectional study indicated some degree of apparent age-related decrement in mental abilities, postponed until after age 50 for some abilities, but beginning after age 35 for others. In particular, Number skills and Word Fluency showed an age-related decre- ment only after age 50, whereas Verbal Mean- ing, Space, and Reasoning scores appeared to decline sooner, after age 35.

2. Successive cross-sectional studies—the cross- sectional sequence—revealed significant intergenerational differences in favor of those born most recently. Even holding age con- stant, those born and tested most recently performed better than those born and tested at an earlier time. For example, 30-year-old examinees tested in 1977 tended to score bet- ter than 30-year-old examinees tested in 1970, who tended to score better than 30-year-old examinees tested in 1963, who, in turn, outper- formed 30-year-old examinees tested in 1956. However, these cohort differences in intel- ligence were not uniform across the different abilities measured by the PMA Test. The pat- tern of rising abilities was most apparent for Verbal Meaning, Reasoning, and Space. Cohort changes for Number and Word Fluency were more complex and contradictory.

3. In contrast to the moderately pessimistic find- ings of the cross-sectional comparisons, the longitudinal comparisons showed a tendency for mean scores either to rise slightly or to re- main constant until approximately age 60 or 70. The only exceptions to this trend involved highly speeded tests such as Word Fluency, in which the examinee must name words in a given category as quickly as possible, and Number, in which the examinee must com- plete arithmetic computations quickly and accurately.

The results of the Schaie study are even more optimistic when individual longitudinal findings are disentangled from the group averages. As previ- ously noted, the longitudinal findings differed from

to 40 years later. Because there is only one group of subjects, longitudinal designs eliminate age-group disparities (e.g., more education in the young than the old subjects) as a confounding factor. However, the longitudinal approach is not without its short- comings. Longitudinal studies are prone to practice effects, which is the finding that participants learn the answers when they take the same test on several occasions; selective attrition, which is the observa- tion that the least healthy participants are the most likely to drop out; and history, which is the discovery that major historical events (e.g., the Great Depres- sion) can distort the intellectual and psychological development of entire generations.

The most efficient research method for study- ing age changes in ability is a cross-sequential

design that combines cross-sectional and longitudi- nal methodologies (Schaie, 1977):

In brief, the researchers begin with a cross- sectional study. Then, after a period of years, they retest these subjects, which provides lon- gitudinal data on several cohorts—a longi- tudinal sequence. At the same time, they test a new group of subjects, forming a second cross-sectional study—and, together with the first cross-sectional study, a cross-sectional sequence. This whole process can be repeated over and over (every five or ten years, say) with retesting of old subjects (adding to the longitudinal data) and first-testing of new subjects (adding to the cross-sectional data). (Schaie & Willis, 1986)

In 1956, Schaie began the most comprehensive cross-sequential study ever conducted in what is referred to as the Seattle Longitudinal Study (Schaie, 1958, 1996, 2005). He administered Thurstone’s test of five primary mental abilities (PMAs) and other intelligence-related measures to an initial cross-sectional sample of 500 community- dwelling adults. The PMA Test subtests include Verbal Meaning, Space, Reasoning, Number, and Word Fluency. In 1963, he retested these subjects and added a new cross-sectional cohort. Additional waves of data were collected in 1970, 1977, 1984, 1991, and 1998.

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 263

a maximum possible score of 76. Participants also took the Mini-Mental State Exam (MMSE) when tested in old age. As noted, this measure is a sim- ple 30-item test of orientation, memory, and other cognitive skills. The MMSE is used for dementia screening, and normal adults typically score in the range of 27 to 30 points.

Mindful that the data come from separate co- horts born in 1921 and 1936, the results appeared to indicate a decline, after age 70, in general intel- ligence as measured by the MHT. Specifically, aver- age scores at age 70, 79, and 87 were 64.2, 59.2, and 54.1, respectively, indicating a gradual decline in general intelligence after age 70. In contrast, orienta- tion, memory, and everyday cognitive skills declined little, about a half a point (on the 30-item MMSE), on average, every decade or so. The scores for both the MHT and the MMSE revealed greater variability with advancing age, a common finding in research on aging.

Gow et al. (2011) also sought to determine whether high intelligence in youth buffers against cognitive decline in old age. This was the special virtue of possessing test scores for all participants at age 11, which allowed researchers to map the trajectories of cognitive capacity as a function of initial ability. In the 1921 cohort tested at ages 79 and 87, they found that higher intelligence at age 11 did not slow the decline experienced in later life. Participants with initially higher MHT scores showed just as much cognitive decline as those with initially lower scores, but still maintained their relative advantage when tested in old age.

Age and the Fluid/Crystallized Distinction

Although we concur with the conclusions of Schaie and Willis (1986), it would be unfair to leave the impression that all authorities in this area agree. Horn and Cattell have been the most vocal skep- tics, arguing for a significant age-related decrement in fluid intelligence because of its reliance upon neural integrity, which is presumed to decline with advancing age (Horn & Cattell, 1966; Horn, 1985). Cross-sectional studies certainly support this view. For example, Wang and Kaufman (1993) plotted age differences in vocabulary and matrices scores from

one mental ability to another. Nonetheless, taking the average of the five PMAs and using the 25th percentile for 25-year-olds as his standard of mean- ingful decline, Schaie has shown that no more than 25 percent of those studied had declined by age 67. From age 67 to age 74 about a third of the subjects had declined, whereas from age 74 to age 81, slightly more than 40 percent had declined (Schaie, 1980, 1996; Schaie & Willis, 1986). In sum, the vast ma- jority of us show no meaningful decline in the skills measured by the Primary Mental Abilities Test until we are well into our seventies. Perhaps even more impressive is the fact that approximately 10 percent of the sample improved significantly when retested in their seventies and eighties. Based on his research and other longitudinal studies, Schaie arrives at this conclusion:

If you keep your health and engage your mind with the problems and activities of the world around you, chances are good that you will ex- perience little if any decline in intellectual per- formance in your lifetime. That’s the promise of research in the area of adult intelligence. (Schaie & Willis, 1986)

A recent study by Gow, Johnson, Pattie, and others (2011) provides additional insight into the fate of intelligence in old age. They obtained follow-up test data from elderly persons at ages 70, 79, and 87, using the same instrument first administered to participants at age 11. One cohort, born in 1921, was retested at age 79 and again at age 87. A second cohort, born in 1936, was retested at age 70. Sample sizes were very large, in the hundreds at each test- ing. The same test, the Moray House Test, No. 12 (MHT), was used throughout. The MHT consists of 71 items involving diverse domains of general intelligence, including following directions, same- opposites, word classification, analogies, practical items, and reasoning. Although little recognized in the United States, the MHT is a respected instru- ment used in Scotland and elsewhere for tracking epidemiological changes in intelligence. MHT total scores correlate about .80 with Stanford-Binet IQ scores (Gow et al., 2011). The test does not provide an IQ. Results are given as a total raw score, with

264 Chapter 6 • Ability Testing: Group Tests and Controversies

74.6, and 84.3 years of age, respectively. These individuals were administered a battery of 37 cog- nitive and neuropsychological measures assem- bled from well-known instruments, including the Wechsler Adult Intelligence Scale-Revised (WAIS-R, Wechsler, 1981), the Primary Mental Abilities test (PMA, Schaie, 1985), and several other tests. In Figure 6.13, we have depicted the results from four key subtests. Two of these subtests depend heavily on fluid cognitive factors (Reasoning and Spatial Think- ing from the PMA), and two require significant crys- tallized abilities (Vocabulary and Comprehension from the WAIS-R). Scores are depicted as a percent- age of the early-old group (ages 60–69), which typi- cally earned the highest average score on all subtests. The reader will notice that raw scores on Compre- hension and Vocabulary (crystallized abilities) reveal a nearly flat trend for the three age groups, whereas raw scores on Reasoning and Spatial Thinking (fluid abilities) disclose a steep decline for individuals in their 70s, 80s, and beyond.

GENERATIONAL CHANGES IN IQ SCORES

What happens to the intelligence of a population from one generation to the next? For example, how does the intelligence of Americans in the year 2010 compare to the intelligence of their forebears in the early 1900s? We might expect that any differences would be small. After all, the human gene pool has remained essentially constant for centuries, perhaps millennia. Furthermore, only a small fraction of any generation is exposed do the extremes of environ- mental deprivation or enrichment that might stunt or boost intelligence dramatically. Common sense dictates that any generational changes in population intelligence would be minimal.

On this issue, common sense appears to be in- correct. Flynn (1984, 1987) charted the comparison data from successive editions of the Stanford-Binet and the Wechsler tests from 1932 to 1981 and found that, with only one exception, each edition estab- lished a higher standard than its predecessor. For example, when the latest edition of the WISC-R was released in the 1970s, a large sample of five- and six-year-old children was tested on both this

the Kaufman Brief Intelligence Test and found little change in vocabulary (crystallized measure) but a sharp drop in matrices (fluid measure). These results held true even when the scores were adjusted for ed- ucational level. Of course, cross-sectional studies are open to rival interpretations and can, therefore, only suggest longitudinal patterns. Readers who wish to pursue this controversy should consult Hofer, Sliwinski, & Flaherty (2002) and Lindenberger and Baltes (1994).

More recently, Schaie, Caskie, Revell, and others (2005) demonstrated the same age-related patterns (negligible changes in crystallized mea- sures, large decrements in fluid measures) in a fol- low-up study of older participants from the Seattle Longitudinal Study. Their participants comprised three groups: early-old (ages 60–69, N = 180), middle- old (ages 70–79, N = 205), and old-old (ages 80–95, N = 114). On average, the three groups were 64.2,

FIGURE 6.13 Cross-Sectional Comparison of Age

Trends for Four Cognitive Subtests Source: Based on

data from Schaie, K. W., Caskie, G., Revell, A., & others

(2005). Extending neuropsychological assessments in the

Primary Mental Ability space. Aging, Neuropsychology,

and Cognition, 12, 245–277.

P e rc

e n ta

g e

o f E

a rl y-

O ld

R e fe

re n

c e

G ro

u p

Early-Old

(M = 64.2)

Middle-Old

(M = 74.6)

Age Group

Comprehension Vocabulary

Reasoning

Spatial

Thinking

Old-Old

(M = 84.3)

100

www.ebook3000.com

http://www.ebook3000.org

Topic 6B • Test Bias and Other Controversies 265

Other explanations for the Flynn effect include better nutrition, improved prenatal care, greater educational access, and increased environmental complexity (Lynn, 2009; Sundet, Borren, & Tambs, 2008). On this last point, environmental complexity, Flynn (2007b) provides a telling illustration by way of generational changes in TV programs. He notes that early 1960s shows like I Love Lucy or Dragnet required almost no concentration to follow, whereas in the 1980s dramas like Hill Street Blues introduced up to 10 threads in the story line. More recently, the hit action-thriller drama 24 portrays as many as 20 characters and multiple plot lines.

In a recent interview, Flynn has suggested that ways of thinking and solving problems have under- gone dramatic worldwide shifts in the last century.

Today we take it for granted that using logic on the abstract is an ability we want to cultivate and we are interested in the hypothetical. Peo- ple from 1900 were not scientifically oriented but utilitarian and they used logic, but to use it on the hypothetical or on abstractions was foreign to them. Alexander Luria [a Soviet psy- chologist] went to talk to headmen in villages in rural Russia and he said to them: “Where there is always snow, bears are white. At the North Pole there is always snow, what colour are the bears there?” And they said: “I’ve only seen brown bears.” And he said: “What do my words convey?” And they said: “Such a thing as not to be settled by words but by testimony.” They didn’t settle questions of fact by logic, they settled them by experience (Witchalls, 2012, p. 1).

Regardless of the causes, the Flynn effect has sensitized psychologists to the dangers of rendering conclusions based on ever-shifting intelligence test norms. Changes in IQ over time make it imperative to restandardize tests frequently, otherwise exam- inees are being scored with obsolete norms and will receive inaccurate IQ scores. This is especially a problem when IQ scores are used for important decisions such as eligibility for learning disability programs, or entitlement to social security benefits. At the other extreme, issues literally of life and death

instrument and the earlier WPPSI, released in the 1960s. The testing was counterbalanced, half of the sample taking the WPPSI first, half taking the WISC-R first. The average WPPSI IQ for these 140 children was 112.8, whereas the same children earned an average WISC-R IQ of about 108.6. Because each new test is calibrated to a general population average of 100, this difference indicates an apparent 4-point gain in the population from the time the WPPSI was standardized (in 1965) to the time the WISC-R was standardized (in 1972). When new revisions are charted against their predecessors in the manner described here, the total apparent gain in mean IQ amounts to about 14 points in the five decades from 1932 to 1981 (Flynn, 1984).

This apparent rise in IQ over generations is known as the Flynn effect in honor of the psychol- ogist who first delineated the occurrence (Flynn, 2007a). Although the Flynn effect may have slowed down in recent decades, in some countries, it is still found in nearly every comparison of average IQs for successive editions of mainstream intelligence tests. This trend of rising performance has been observed in many nations using other tests as well, including Raven’s Progressive Matrices and the Peabody Picture Vocabulary Test (Daley, Whaley, Sigman, Espinosa, & Neuman, 2003; Nettelbeck & Wilson, 2004).

However, IQ gains of the magnitude observed pose a serious problem of causal explanation. Flynn (1994) is skeptical that any real and meaningful in- telligence of a population could vault upward so quickly. He concludes that current tests do not mea- sure intelligence but rather a correlate with a weak causal link to intelligence:

Psychologists should stop saying that IQ tests measure intelligence. They should say that IQ tests measure abstract problem-solving ability (APSA), a term that accurately conveys our ig- norance. We know people solve problems on IQ tests; we suspect those problems are so de- tached, or so abstracted from reality, that they ability to solve them can diverge over time from the real-world problem-solving ability called intelligence; thus far we know little else. (Flynn, 1987)

266 Chapter 6 • Ability Testing: Group Tests and Controversies

performance from the 1950s until the 1990s, followed by a reversal and decline. Using Piagetian tests of conservation of weight, volume, and quantity with seventh-grade British schoolchildren, Shayer, Ginsburg, and Coe (2007) documented a steady de- cline in performance from 1975 to 2003, a phenom- enon they dubbed the “anti-Flynn effect.”

Yet, in many countries the Flynn effect con- tinues unabated. Flynn and Rossi-Casé (2012) found large gains on Raven’s Progressive Matrices in Argentina between 1964 and 1998. In South Korea, te Nijenhuis, Cho, Murphy, and Lee (2012) reported large IQ gains as well. The Flynn effect continues to be a puzzling and complex phenomenon.

can be at stake when IQ scores impact capital pun- ishment decisions via the diagnosis of mental retar- dation (Kanaya, Scullin, & Ceci, 2003).

Several recent studies indicate that the Flynn effect may have abated or even reversed in the be- ginning of the twenty-first century, at least in some countries. Reviewing data from more than a half- million Danish men over the period 1959 to 2004, Teasdale and Owen (2005) found that average per- formance on a military entry intelligence test gained slowly, peaked in the late 1990s, and has since de- clined slowly. Sundet, Barlaug, and Torjussen (2004) found a similar pattern with Norwegian con- scripts on a test of matrix reasoning, with improved

www.ebook3000.com

http://www.ebook3000.org

Chapter 6: Ability Testing: Group Tests and Controversies

Topic 6A: Group Tests of Ability and Related Concepts

Nature, Promise, and Pitfalls of Group Tests
Group Tests of Ability

Multidimensional Aptitude Battery-II (MAB-II)
A Multilevel Battery: The Cognitive Abilities Test (CogAT)
Raven's Progressive Matrices (RPM)
Perspective on Culture-Fair Tests

Multiple Aptitude Test Batteries

The Differential Aptitude Test (DAT)
The General Aptitude Test Battery (GATB)
The Armed Services Vocational Aptitude Battery (ASVAB)

Predicting College Performance

The Scholastic Assessment Test (SAT)
The American College Test (ACT)

Postgraduate Selection Tests

Graduate Record Exam (GRE)
Medical College Admission Test (MCAT)
Law School Admission Test (LSAT)

Educational Achievement Tests

Iowa Tests of Basic Skills (ITBS)
Metropolitan Achievement Test (MAT)
Lexile Measures
Tests of General Educational Development (GED)
Additional Group Standardized Achievement Tests

Topic 6B: Test Bias and Other Controversies

The Question of Test Bias

The Test-Bias Controversy
Criteria of Test Bias and Test Fairness
The Technical Meaning of Test Bias: A Definition
Bias in Content Validity
Bias in Predictive or Criterion-Related Validity
Bias in Construct Validity
Reprise on Test Bias
Case Exhibit 6.1: The Impact of Culture on Testing Bias

Social Values and Test Fairness

Unqualified Individualism
Quotas
Qualified Individualism
Reprise on Test Fairness

Genetic and Environmental Determinants of Intelligence

Genetic Contributions to Intelligence
Environmental Effects: Impoverishment and Enrichment
Teratogenic Effects on Intelligence and Development
Effects of Environmental Toxinson Intelligence

Origins and Trends in Racial IQ Differences

Early Studies of African American and White IQ Differences
The Genetic Hypothesis for Race Differences in IQ
Tenability of the Genetic Hypothesis
Recent Trends in Race Differences in IQ

Age Changes in Intelligence

Early Cross-Sectional Research
Sequential Studies of Intelligence
Age and the Fluid/Crystallized Distinction

Generational Changes in IQ Scores

44 Journal of Intellectual Disability Research doi: 10.1111/j.1365-2788.2008.01121.x

volume 53 part 1 pp 44–53 january 2 9

Reliability and validity of the revised Triple C: Checklist of Communicative Competencies for adults with severe and multiple disabilities

T. Iacono,1 D. West,2 K. Bloomberg2 & H. Johnson2

1 Centre for Developmental Disability Health Victoria, Monash University, Melbourne,Victoria, Australia and the Communication Resource Centre, Scope, Melbourne,Victoria, Australia 2 Communication Resource Centre, Scope, Melbourne,Victoria, Australia

Abstract

Aims Few tools are available to assess the commu- nication skills of adults with severe and multiple disabilities functioning at unintentional to early symbolic levels. An exception is the Triple C: Checklist of Communicative Competencies. In this study, aspects of support worker and clinician agreement, internal consistency and construct valid- ity of a revised version of the Triple C were explored. Method Triple C checklists were completed for 72 adults with severe intellectual disabilities (ID) by 118 support workers and stages were assigned by the researchers. Two support workers completed checklists for each of 68 adults with ID. Three researchers also conducted direct observations of 2 adults with ID. Results The average support worker agreement for items across the five stages of the Triple C ranged from 81% to 87%; agreement for stage assignment based on first and second support worker checklists was moderate to high (k = .63). Internal consis- tency was high (KR2 = .97); the stages were found to tap one factor (accounting for ~74% of variance), interpreted to be unintentional to early

Correspondence: Associate Professor Teresa Iacono, Centre for Developmental Disability Health Victoria, Monash University, Building 1, 27 Ferntree Gully Road, Nottinghill, Victoria, Austra- lia, 3166 (e-mail: [email protected]).

symbolic communication. Agreements between stages based on researcher observations and support worker-completed checklists were 35% and 71% across first and second support workers. Conclusion The revised Triple C provides a reliable means of gathering data on which to determine the communication skills of adults with severe and mul- tiple disabilities. The results support a collaborative use of the Triple C, such that a speech-language pathologist or other communication specialist works with a support worker to ensure understanding of the skills observed and development of appropriate intervention strategies.

Keywords communication, assessment, intellectual disability, proxy reports, direct carers

Introduction

For adults with developmental disabilities, access to specialist services, such as speech pathology, can be limited (Stancliffe 2 6). In Australia (Bloomberg et al. 2 3) and the UK (Money 2 2), for example, such limited access seems to be the impetus for training support workers (i.e. paid care staff) in small group residences or day services to implement communication interventions with their clients with developmental disabilities. A conse- quence has been reliance on these support workers

mailto:[email protected]

45 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

to provide data needed for communication assess- ment, which, in turn, can inform service-based interventions.

The extent to which paid care staff can be relied on to provide accurate information has been of some concern. Purcell et al. (1999), for example, found that paid residential and day-placement staff tended to overestimate the language-comprehension abilities of their clients. They also had difficulty with identifying their non-verbal signals. The ability of staff to judge their clients’ abilities has relevance to the use of assessments that rely on proxy reports. Such reporting is used for assessment in a number of areas, such as health (Iacono & Sutherland 2 6; Wang et al. 2 7), psychopathology (Moss et al. 1998), challenging behaviour (Newton & Sturmey 1991; Harris et al. 1994), quality of life (Campo et al. 1997; Stancliffe 1999) and communication (van der Gaag 1989; Purcell et al. 1999; Iacono et al. 2 5).

Investigations into the reliability of proxy report- ing by paid care staff have varied in both method and outcomes. Stancliffe (1999), for example, reported a correlation of .59 between scores on the Quality of Life Questionnaire Empowerment factor (measuring choice and control) based on self- reports by adults with intellectual disabilities (ID) and those of their paid carers. The correlation between two paid carers rating the same adult with ID was .73. Harris et al. (1994) compared ratings on two scales of the Checklist of Challenging Behaviour (Aggression and Other Challenging Behaviours) by conducting two interviews with two paid carers for each adult with ID. Agreement between the two checklists ranged from 75% to 77% across the scales for frequency, severity and management difficulty. Moss et al. (1998) provided reliability data on individual items of the Psycho- logical Assessment Schedule for Adults with Devel- opmental Disabilities (PAS-ADD) Checklist on the basis of data provided by key informants, including paid carers. Cohen’s kappa scores for individual item agreement ranged from as low as .3 to .7, but the overall kappa was .79. The authors argued, however, that the reliability measure of most concern was that which determined the extent to which raters’ scores agreed as to whether the indi- vidual was above or below the thresholds for pos- sible psychological disturbance, which would trigger

a referral for a complete psychiatric assessment. They reported 79% agreement between raters on whether or not at least one threshold for the three sub-scales had been crossed.

In terms of judgements about communication behaviours, Iacono et al. (1998) reported difficulties in obtaining agreement for judgements about the intentionality and functions of communication of children with severe and multiple disabilities. Carter & Iacono (2 2) found that professionals, including speech pathologists and special education teachers, demonstrated inconsistencies in their judgements about the intentional communication of children with severe and multiple disabilities, children with Down syndrome, and even a child without disabil- ity. These difficulties, however, may result from asking observers to make overall judgements, as opposed to reporting behaviours seen, for example, asking if a person has been observed to search for an object that has disappeared from view as opposed to having demonstrated pre-intentional or intentional communication.

An additional complication may be the extent to which the person being asked to rate or make judgements about an individual’s communication regularly interacts with that individual. Granlund & Olsson (1993) compared observational versus teacher interview data on communication behav- iours of adults and adolescents with profound ID. The communication behaviours were analysed for communication complexity (i.e. developmental level of the form used, such as pointing and looking vs. pointing only) and frequency. They obtained higher correlations when comparing interview data obtained from the same teacher who interacted in videos (used to code observational data) across the functions of social interaction, joint attention and behavioural regulation (r = .6 , .83, .6 respectively) than when a different teacher parti- cipated in the two conditions (r = .5 , .32, .3 respectively).

van der Gaag (1989) argued that direct care staff of adults with ID have a wealth of knowledge and experience about the individuals in their care. On a premise of equal status between professionals and carers, she developed the Communication Assess- ment Profile (CASP). In the first part of the CASP a carer reports on the communication functions displayed by an individual with ID, and the every-

46 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

day situations in which that individual participates. The second part of the CASP is completed by a speech pathologist, based on observation of the individual. The third part, which is jointly com- pleted by the speech pathologist and carer, is used for intervention planning. van der Gaag reported on the reliability of carer ratings by comparing those of a person with ID’s key support worker with those of a support worker who had regular contact with, but no particular responsibility for that person. An overall correlation between these support worker ratings was .72; agreements between support worker ratings and speech pathologists’ observations (Parts 1 and 2 of the CASP) ranged from 5 % to 85%, while agreement between speech pathologists ranged from 81% to 99%.

Purcell et al. (1999) further explored the reliabil- ity of paid direct staff reports on the CASP by com- paring their ratings of adults with ID with assessments conducted by speech pathologists. In addition, the support workers rated the adults’ non- verbal signals before and after a 15-min interaction session, and the communication behaviours of both the support workers and their clients during a natural interaction were analysed by the researchers. Support workers volunteered for the study and selected a client with whom they were comfortable and felt they knew well. Agreement between their ratings of non-verbal behaviours from before versus after the interaction session ranged from 53% to 71% for non-verbal communication and from 42% to 79% for communicative functions. Agreements between support worker and speech pathologists’ ratings ranged from as low as 37% to 95% for non- verbal behaviours and from 58% to 92% for com- municative functions. Comparisons of support worker ratings and researcher counts based on observations indicated that the support workers had particular difficulty in rating non-verbal signals and the function of commenting.

Despite the poor results for both intra- and inter- rater reliability for direct care staff data, arguments have been made for including support workers as key players in the communication assessment and intervention of their clients (e.g. van der Gaag 1989; Bloomberg et al. 2 3). According to van der Gaag (1989), support workers can provide informa- tion about their client’s communication behaviours and skills that are not readily available to a profes-

sional. In particular, professionals have limited opportunities for interaction and observation of a client; in contrast, support workers experience prolonged engagement with clients across daily activities.

Further argument for involving support workers comes from a movement towards providing tertiary level communication interventions with a focus on enhancing functional outcomes within everyday situations (Granlund et al. 1995). Such approaches must involve those who interact with people with disabilities on a daily basis, in particular, family and/or direct paid staff. The variable data on the reliability of communication information obtained from paid carers suggest a need for providing them with sufficient information to enable them to understand and recognise the behaviours being assessed (Purcell et al. 1999). In addition, Iacono et al. (2 5) suggested that the process of partici- pating in an assessment process in collaboration with a professional could, in itself, act to increase the carer’s knowledge of and sensitivity to poten- tially communicative behaviours. Difficulty with rec- ognising communication appears to be of particular concern for adults with severe communication impairment who do not demonstrate spoken or even symbolic skills. For these individuals, both observation and recognition of non-verbal behav- iours as possibly communicative are needed to obtain comprehensive information about their com- munication skills and potential (Iacono et al. 1998; Purcell et al. 1999; Carter & Iacono 2 2).

Unfortunately, there are few assessment protocols developed specifically for adults with ID, particu- larly those that tap unintentional or intentional but non-symbolic communication behaviours (Iacono & Caithness in press), which characterise many indi- viduals at severe to profound levels of ID (McLean et al. 1999). An exception is the Triple C: Checklist of Communicative Competencies, designed by Bloomberg & West (1999) to determine the com- munication skills of adolescents and adults with severe and/or multiple disabilities. The target group are individuals who are unintentional or early intentional/symbolic communicators; hence they are functioning well below the level at which other communication assessments begin (e.g. Dewart & Summers, undated). The Triple C is administered by support workers and others who are familiar

47 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

with the individual. A retrospective study of 172 completed Triple C checklists by Iacono et al. (2 5) demonstrated the tool’s internal consistency and its underlying single factor structure. However, a problem was evident with Stage 1, which was found to have internal consistency that although adequate (KR2 = .77), was lower than those obtained for the other five stages (KR2 of .85 or above). On the basis of these results, the Triple C was revised by: (1) collapsing the original Stages 1 and 2 items (with some deletions); (2) requiring a response for all items on the checklist by checking boxes to indicate observed or not observed (as opposed to providing a check mark only if observed); and (3) changing some terminology and examples on the basis of feedback from support workers and clinicians. The inter-rater reliability of the Triple C was not determined in that first study, and with the changes made, there was also a need to address the construct validity and internal con- sistency of the revised tool.

Aim

The aim of the current study was to determine support worker agreement, internal consistency and underlying constructs for the revised version of the Triple C, and the extent to which data obtained from the tool compared with direct observations by clinicians.

Method

Ethics

Approval for the conduct of the study was obtained from two formally constituted Human Research Ethics Committees and a disability organisation’s ethics committee. Proxy consent from the next of kin or guardians of adults with ID was obtained given that the criteria for inclusion precluded their ability to provide their own consent. Direct consent was obtained for support worker participation.

Participants

Adults with intellectual disability

The criteria for inclusion of adults with ID were that: (1) they were not linguistic (i.e. vocabulary

was less than 5 words, and any word combinations were rote productions); and (2) consent had been received for information about them to be included in the study. Seventy-two adults with ID (44 men, 28 women) participated, aged from 2 to 7 years (mean = 41, mode = 36). All had ID (nine had Down syndrome). Additional reported disabilities were Autism Spectrum Disorder (n = 14) and cerebral palsy (n = 18). Ten participants had reported uncorrected vision impairment, four had reported hearing impairment, and 21 were non-ambulatory. Twenty of the adults (15 men, 5 women) also participated in direct observations. They were aged 2 –7 years (mean = 37 years): one had Down syndrome, and in addition to other forms of ID, five were reported to have Autism Spectrum Disorder and four to have cerebral palsy.

Support workers

The criteria for the inclusion of support workers were that they (1) had not had training or experi- ence in using the Triple C; and (2) had worked for at least 6 months with an adult with ID for whom consent had been received. The support workers came from accommodation and day services for adolescents and adults with severe and/or multiple disabilities in both rural and metropolitan regions across Victoria, Australia. A total of 118 support workers consented to participate in the study (89 women, 29 men). They had been support workers for a mean of 9 years (range from �1 year to 33 years) and had worked with their target adult with ID for a mean of 4 years (range = less than a year to 15 years). The highest level of education was Year 9, 1 or 11 of high school for 33 support workers, high school (i.e.Year 12) for 26, Tertiary and Further Education for 37, and University for 12 (1 participants did not provide this information).

Most support workers (n = 76) completed check- lists for only one target adult with ID, but 25 com- pleted them for two adults, five for three adults and two for four adults.

Revised Triple C checklist

The revised checklist comprises five stages that reflect the continuum from unintentional

48 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

to symbolic communication. These stages are described, with sample items provided in the Appendix.

Procedures

Each support worker attended a 2–3-h training session on how to complete a Triple C checklist, conducted by one of the researchers. In the train- ing, videos and case examples of adults with ID were used to demonstrate skills within each stage. At the beginning of the session, the research was introduced, and explanatory statements and consent forms were distributed. The support workers were also asked to complete a questionnaire that pro- vided background information. During this session, support worker pairs were identified for assignment to the same adult with disability. After the session, support workers were asked to complete the checklist on that adult without discussion with the other support worker also completing a checklist for the same adult. Completing the checklist required that for each item within each stage, the support worker indicated if the behaviour had been ‘observed’ or ‘not observed’. At the end of 2 weeks, they submitted the checklist to the researchers, who then assigned a code and de-identified it prior to data entry.

Completed Triple C checklists were used by two of the researchers to assign a stage, according to Bloomberg & West (1999), using a consensus process; that is, two of the researchers jointly viewed each completed checklist and assigned a stage.

Twenty adults from two participating disability services who had consented to a second stage of the study were observed individually by the researchers working in pairs (all are experienced speech-language pathologists). Observations occurred in the adult’s home or day service during routines, including meals, programmed activities (e.g. music programme, cooking and art) and leisure over a 2–3-h period. Notes on salient communication behaviours were made. Based on these observations, an estimate was made of the person’s highest level of communica- tion by using the Triple C stages, through a consensus process between the two observing researchers.

Results

Support worker agreement

Support workers were assigned as either first or second observers for the purpose of data analysis. Those support workers who completed checklists for more than one target adult were alternatively assigned as first or second observers across the adults. Checklists were obtained from two support workers for 68 of the participating target adults. Agreement for each item was determined by calcu- lating across adults with ID the number of agree- ments, divided by number of agreements plus disagreements and multiplying by 1 to yield per- centage scores. These ranged from 63% to 95% (mean = 85%) for Stage 1/2, from 7 % to 94% (mean = 81%) for Stage 3, from 73% to 95% (mean = 83%) for Stage 4, from 69% to 92% (mean = 84%) for Stage 5, and from 76% to 97% (mean = 87%) for Stage 6.

The researcher consensus coding indicated that for four pairs of checklists, a stage could not be assigned for at least one of the checklists because the data failed to reveal an interpretable pattern. These pairs were deleted from the analysis to deter- mine agreement for stage assignment. The number of completed checklists that were assigned to each stage across the two support workers is shown in Table 1. Agreement between researcher stage assign- ment based on checklists completed by the first versus second support worker was determined using Cohen’s kappa, which yielded a moderate to high coefficient: k = .63 (P < . 1). Table 1 also indi- cates the relatively even spread of checklists that

Table 1 Number of participants assigned to each of the Triple C

stages based on support worker 1 and support worker 2 checklists

Stage assignment

Stage Support worker 1 Support worker 2

1/2 10 (16) 12 (19) 3 14 (22) 12 (19) 4 12 (19) 16 (25) 5 16 (25) 13 (20) 6 12 (19) 11 (17)

Figures in brackets are percentages based on n = 64.

49 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

Table 2 Factor analysis communality scores and factor loadings for support worker 1 and 2 data

Support worker 1 Support worker 2

Stage Communality Factor loading Communality Factor loading

1/2 0.39 0.62 0.59 0.77 3 0.75 0.87 0.72 0.85 4 0.98 0.99 0.87 0.94 5 0.78 0.89 0.78 0.88 6 0.50 0.71 0.41 0.64

were assigned across the five stages for both first and second support worker data.

To determine internal consistency and the con- struct validity of the revised Triple C, analyses were repeated for two sets of data with unreliable data removed, that is, obtained from checklists com- pleted by the first (n = 68) and second (n = 65) support workers for which a stage could be assigned. Internal consistency was determined using KR2 (non-parametric equivalent to Cronbach’s alpha). For the first observer data, the overall KR2 was .97, with those for each stage ranging from .88 to .93. Similar results were obtained for second observer data: the overall KR2 was .97, with those for each stage ranging from .8 to .97.

Construct validity was determined by using stage totals. Preliminary analysis indicated the data from both observers were suitable for factor analysis: Kaiser-Meyer-Olkin (KMO) sampling adequacy was above .7 and Bartlett’s Test of Sphericity was non-significant (P = . ). Data from both data sets were found to be non-normally distributed (Kolmogorov-Smirnov, P < . 1), hence the extrac- tion method for factor analysis chosen was Principal Axis Factoring, as recommended by Fabrigar et al. (1999). For respective sets of data, analysis yielded a one-factor solution (eigenvalue = 3.7) that accounted for 74% and 73% of the variance. Factor loadings and communality scores for each of the stages are presented in Table 2, which indicates that Stages 3 to 5 were the strongest indicators of the factor, while Stages 1/2 and 6 were moderate indicators.

Agreement was determined between stages deter- mined by Triple C checklists completed by support workers and those determined by researcher obser-

vations for the 2 adults with ID. This comparison was conducted for both first and second support worker data. Point-by-point agreement with first support worker data was only 35% and that with the second support worker was 71%. Correlations between the data sets were high for both the first and the second support worker data (r = .79 and .94 respectively), reflecting the fact that differences were never more than one stage.

Discussion

The results of the present study indicated overall high levels of agreement across two support workers about the behaviours of the same individual, that is, in terms of each item within each stage of the checklist. The lowest agreement score was 63% for individual items, with stage averages never falling below 83%. Direct comparison with previous studies is difficult in that they addressed behaviours other than communication (e.g. Harris et al. 1994; Moss et al. 1998) or, in the case of van der Gaag (1989), communicative functions; nonetheless, they provide some means of comparison for observations made by direct care staff. In examining inter-rater reliability on the PAS-ADD checklist for detecting psychiatric disorder, Moss et al. (1998) reported individual item agreement in the form of Cohen’s kappa, with the overall score being quite low at .42, but ranging from .3 to .8. van der Gaag (1989) did not report point-by-point agreement, but rather a correlation of .72 between support worker ratings of communication functions, which, she argued, was indicative of a high rate of agreement.

There are a number of possible reasons for the relatively high levels of agreement across support

50 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

workers found in the present study. First, they received training in the use of the Triple C. This training included information on stages of the Triple C, and working through each item across stages, including the provision of examples through cases presented descriptively and on video. Other researchers reporting on the ability of direct care staff to provide communication information, such as van der Gaag (1989) and Purcell et al. (1999), did not report the provision of training. Moss et al. (1998) suggested that their own poor inter-rater agreement resulted from using untrained raters, as well as failing to provide a guide to rating in the form of a glossary. Some form of training would seem essential given that direct care staff come from diverse backgrounds, often with limited education. Certainly, many of the support workers in the present study had not completed high school, with few reporting tertiary level education. Hence, such direct care staff may be naïve about the behaviours or skills about which they are being asked to report. On the other hand, as argued by van der Gaag (1989), direct care staff develop an expertise about their individual clients as a result of working closely with them on a daily basis and, often, for prolonged periods. This type of expertise may be readily enhanced if the individual is provided the opportu- nity to consider the person he or she is supporting during structured training and if this training targets the required task, such as completion of an assessment checklist.

A second reason for the relatively high levels of agreement on checklist items obtained in the current study may lie in the nature of the task: support workers were asked to report on specific behaviours, rather than to make a judgement about how behaviours relate to particular stages of com- munication. Such judgements would seem more the purview of communication specialists, such as speech-language pathologists, given pre-symbolic communication skills, particularly in adults with significant and often multiple disabilities, are diffi- cult to identify even for professionals (Carter & Iacono 2 2). On the other hand, the information used for such judgements requires observation of an individual across situations and over time, as occurs with other assessment tools developed for individu- als unable to self-report or participate in structured tasks (e.g. Moss et al. 1998). The Triple C, there-

fore, was designed to facilitate the gathering of data by a paid or family carer that can then be used by a speech pathologist or other communication special- ist to make judgements about a person’s level of communication. The PAS-ADD checklist was devel- oped on a similar basis, albeit in relation to behav- iours that could signal psychopathology. As a result, Moss et al. (1998) argued that the item agreement was far less crucial than was the agreement between raters as to those individuals who had scores above the threshold indicative of psychiatric disorder. In the current study, we were concerned about the agreement across the two raters in terms of stage assignment based on completed checklists and determined through a consensus process by two of the researchers. We obtained a moderate to high level of agreement for stage assignment across the two support workers, indicating acceptable agree- ment for the judgements that are based on the support worker data. It should be noted, however, that a few support workers did not provide reliable data. This lack of reliability was evident in checklists that failed to provide a cohesive picture of the adult’s skills (e.g. checking a few items from all stages).

The two sets of data from first and second support workers, with unreliable checklists removed, also yielded similar results in terms of internal consistency and construct validity, as would be expected given the agreement data. Hence, the revised Triple C was found to retain the high level of internal consistency of the first. In addition, the revised Triple C still tapped one underlying con- struct of early communication, but the results for construct validity were clearer than those obtained for the original version. On the basis of analysis of retrospective data, Iacono et al. (2 5) found that the original version tapped only one interpretable factor, argued to be pre-intentional to pre-symbolic ability. The analysis also showed an additional factor onto which Stage 1 loaded highly, but this stage also loaded moderately onto the first factor. In that study, it was argued that the second factor that was tapped by the Triple C may have been more interpretable if the proportion of participants within each stage was better distributed; given that for the retrospective data it was mostly at Stages 3 and 4.

In fact, in the current study, it was found that for the revised version, there was no indication of a

51 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

second factor, with all stages loading moderately to highly onto the one factor. The difference is logi- cally the result of deleting many of the original Stage 1 items and combining the rest with Stage 2 to form a new Stage 1/2, thereby possibly eliminat- ing those that tapped any other factors. In addition, the much smaller sample size of the current study (68 used in the factor analysis for first support worker data) in comparison with that in the original study (n = 172), as well as the more even distribu- tion of communication stages represented (ranging from 16% to 25% across the stages for first and second observer data) may have contributed to this different finding. Given that the factor loadings for all stages, including Stage 6, were moderate to high (above .6), the interpretation of this factor needs to extend from pre-symbolic to symbolic. Stage 6 behaviours reflect established symbolic ability as demonstrated by the use of symbolic forms (spoken, signed or pictured words) (see the Appendix).

A concern with the factor analyses conducted for the present study was the small sample size. Con- vention would suggest the need for much larger samples. According to Stevens (1996), a factor with components comprising four or more components with loadings of .6 or above are considered reli- able, even with small sample sizes. Given that the loadings for all five stages were above .6, given the use of an extraction method that took non-normal distribution into account, and given the similarity of findings across the two support worker data (expected in light of the agreement scores), support for the finding of an underlying single factor would seem strong.

The question of agreement between the stage assignment based on direct researcher observations versus support worker completed checklists was addressed in an attempt to approximate concurrent validity. The results were disappointing, especially for the first support worker data, and reflect previ- ous findings of poor agreement between speech pathology and support worker observations (e.g. van der Gaag 1989; Purcell et al. 1999). The high correlations, however, are indicative that there was never more than one stage difference based on the two data sources. It is tempting to suggest that judgements based on people who could be consid- ered experts by way of professional training and

experience are likely to be the more accurate than those of untrained carers. However, these judge- ments were based on relatively brief observations. Also, although they were conducted in the person’s daily settings, there may have been limited opportu- nity to observe a range of communication behav- iours. There was therefore no strategy for determining the accuracy of the judgements based on the observations beyond observing in pairs and consensus judgements. The challenge in any attempt to determine the concurrent validity of the Triple C is the absence of any formal communica- tion assessment appropriate for people with early communication skills that can provide a basis for comparison. As an example, van der Gaag (1989) and Purcell et al. (1999) used the CASP (Purcell et al. 1999), a tool that assumes at least intentional communication.

Clinical implications

The clinical implications of this study are that if used as designed with collaboration between paid carers and speech-language pathologists, the revised Triple C can be used with confidence. This conclu- sion is made with the caveat that its primary purpose is to provide a means of assessing the communication skills in adults with severe communication impairment such that they are pre-symbolic or have limited symbolic ability. Functionally, these are individuals who have limited speech skills and who may struggle to convey their basic needs and wants, let alone more complex messages such as physical or emotional states. Furthermore, the Triple C was designed as a means of sensitising support staff to their potentially com- municative behaviours, which typically may have gone unrecognised. In a context in which a speech- language pathologist has an opportunity to discuss items and any inconsistencies with a support worker, completion of the Triple C is likely to assist in identifying the communication strengths of these adults.

After completion of the Triple C for a person with ID, the next step is to identify strategies that best support interactions in light of his/her stage of communication. Unfortunately, there is limited guidance for developing such interventions for people whose skills may fall somewhere along

52 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

the unintentional to early symbolic continuum. A recently published manual by Bloomberg et al. (2 4) provides a resource for mapping strategies to Triple C communication stages, however.

Research implications

Further research into the concurrent validity of the revised Triple C, with a larger sample is warranted. Given the lack of assessments appropri- ate for adults at unintentional to early symbolic communication levels, comparisons may need to again be based on direct observations. A combina- tion of prolonged observation and structured sampling (such as used by McLean et al. 1999) may prove to be the only option. Research is also needed on the extent to which completion of the Triple C does in fact assist support workers and family carers to become more aware of the commu- nication ability of persons with ID, and their ability to respond appropriately to their communication attempts.

Acknowledgements

Thanks are extended to the clients and staff of the services that participated in this project.

References

Bloomberg K. & West D. (1999) The Triple C – Checklist of Communicative Competencies. Scope, Melbourne, Vic.

Bloomberg K., West D. & Iacono T. (2 3) PICTURE IT: an evaluation of a training program for carers of adults with severe and multiple disabilities. Journal of Intellectual and Developmental Disability 28, 26 –82.

Bloomberg K., West D. & Johnson H. (2 4) InterAAC- tion: Strategies for Intentional and Unintentional Communi- cators. Scope, Melbourne, Vic.

Campo S., Sharpton W., Thomson B. & Sexton D. (1997) Correlates of quality of life of adults with severe or pro- found mental retardation. Mental Retardation 5, 329–37.

Carter M. & Iacono T. (2 2) Professional judgments of the intentionality of communicative acts. Augmentative and Alternative Communication 18, 177–91.

Dewart H. & Summers S. (Undated) The Pragmatics Profle of Everyday Communication Skills in Adults. Avail- able at: http://wwwedit.wmin.ac.uk/psychology/pp/ (retrieved 3 April 2 7).

Fabrigar L. R., Wegener D. T., MacCallum R. C. & Strahan E. J. (1999) Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods 4, 272–99.

van der Gaag A. (1989) Joint assessment of communica- tion skills: formalising the role of the carer. The British Journal of Mental Subnormality 5, 22–8.

Granlund M. & Olsson C. (1993) Investigating communi- cative functions in profoundly retarded persons: a com- parison of two methods of obtaining information about communicative behaviours. Mental Handicap Research 6, 112–3 .

Granlund M., Bjorck-Akesson E., Brodin J. & Olsson C. (1995) Communication intervention for persons with profound disabilities: a Swedish perspective. Augmenta- tive and Alternative Communication 11, 49–59.

Harris P., Humphreys J. & Thomson G. (1994) A check- list of challenging behaviour: the development of a survey instrument. Mental Handicap Research 7, 118–33.

Iacono T. & Caithness T. Assessment issues. In: Autism Spectrum Disorders and AAC (eds P. Mirenda & T. Iacono), pp. 23–48. Paul H. Brookes, Baltimore, MD (in press).

Iacono T. & Sutherland G. (2 6) Health screening and developmental disability. Journal of Policy and Practice in Intellectual Disabilities , 155–63.

Iacono T., Carter M. & Hook J. (1998) Identification of intentional communication in students with severe mul- tiple disabilities. Augmentative and Alternative Communi- cation 14, 1 2–14.

Iacono T., Bloomberg K. & West D. (2 5) A preliminary investigation into the internal consistency and construct validity of the Triple C: Checklist of Communicative Competencies. Journal of Intellectual and Developmental Disability 0, 139–45.

McLean L. K., Brady N. C., McLean J. E. & Behrens G. A. (1999) Communication forms and functions of children and adults with severe mental retardation in community and institutional settings. Journal of Speech Language & Hearing Research 42, 231.

Money D. (2 2) Speech and language therapy manage- ment models. In: Management of Communication Needs in People with Learning Disability (eds S. Abudarham & A. Hurd), pp. 82–1 2. Whurr, London.

Moss S., Prosser H., Costello A., Simpson N., Patel P., Rowe S. et al. (1998) Reliability and validity of the PAS-ADD checklist for detecting psychiatric disorders in adults with intellectual disability. Journal of Intellectual Disability Research 42, 173–83.

Newton J. & Sturmey P. (1991) The Motivation Assess- ment Scale: inter-rater reliability and internal consis- tency in a British sample. Journal of Mental Defciency Research 5, 472–4.

Purcell M., Morris I. & McConkey R. (1999) Staff per- ceptions of the communicative competence of adult

http://wwwedit.wmin.ac.uk/psychology/pp

53 Journal of Intellectual Disability Research volume 53 part 1 january 2 9

T. Iacono et al. • Triple C reliability and validity

persons with intellectual disabilities. The British Journal of Developmental Disabilities 88, 16–25.

Stancliffe R. (1999) Proxy respondents and the reliability of the Quality Of Life Questionnaire Empowerment factor. Journal of Intellectual Disability Research 4 , 185– 93.

Stancliffe R. (2 6) Services, availability, issues and models. In: Community disability services: An evidence- based approach to practice (eds I. Dempsey & K. Nankervis), pp. 272–295. UNSW Press, Sydney.

Appendix

Revised Triple C stage description

Stevens J. (1996) Applied Multivariate Statistics for the Social Sciences, 3rd edn. Lawrence Erlbaum Associates, Mahwah, NJ.

Wang K.-Y., Hseih K., Heller T., Davidson P. & Janicki M. (2 7) Carer reports of the health status among adults with intellectual/developmental disabilities in Taiwan living at home and in institutions. Journal of Intellectual Disability Research 51, 173–83.

Accepted 28 July 2008

Communication level Description Example items

Unintentional passive

Unintentional active

Intentional informal

Behaviours produced in response to internal and external stimuli are assigned intent or meaning by a communication partner.

Beginning attempts to act purposefully on objects, with behaviours assigned intention or meaning by a communication partner.

Acting on the environment to create a specific effect, resulting in communication attempts through informal rather than symbolic

Shows an awareness of sounds, particularly voices. Visually follows slowly moving objects or people.

Reaches or moves towards familiar people in familiar situations. Reaches for or looks at an object to indicate preference/choice

Imitates novel behaviours. Uses people to get objects.

means. Symbolic (basic)

Symbolic (established)

Integration of information from each of the senses, trial and error to solve simple problems and uses conventionally understood symbols within limited contexts.

Solving of problems through thinking about them; the person had internal representations and can use symbols in a range of contexts.

Gives or shows an object to a person to obtain an action. Follows a simple instruction out of routine.

Predicts cause/effect relationships. Uses photos, pictures or signs for choice making.

Place your order now for a similar assignment, and have writers from our team of experts write it for you, guaranteeing you an A+

Order a similar assignment, and have writers from our team of experts write it for you, guaranteeing you an A