THE EFFECTIVENESS OF RATER TRAINING IN IMPROVING THE SELF-ASSESSMENT INTRA-RATER RELIABILITY OF ENGLISH SPEAKING PERFORMANCE Nur Rini Abstract This quantitative study is an attempt to estimate the intra-rater reliability of student self-assessment of their speaking performances and to find out whether there is significant difference between the self-assessment intra-rater reliability of speaking performance without training and that of with training. The rater training used is adapted from the model developed by Herman, Aschbacher and Winters (1992). This study which employed equivalent time- samples design collected data by asking 45 students to conduct self- assessment on their six speaking performances. It was found that the range of rs was 0.611 to 0.752 which means the consistency within students in assessing their own speaking performance was moderate high to high. The intra-rater reliability of the self assessment after the treatments is higher than that of other experience being available in the absence of the treatment. It is concluded that rater training improves the intra-rater reliability. Therefore, it is suggested to train the students on how to assess before employing self- assessment in speaking instructions. Key words: self-assessment, intra-rater reliability, rater training, speaking Introduction Self-assessment is appraisal by a student of his or her own work or learning processes (O'Malley jand Valdez Pierce 1996; Sawyer, Watson, and Adams 1989). This study defines self-assessment as an activity in which the student is asked to describe his or her speaking performance by filling the self-assessment form which is designed by the teacher and the students. To estimate the intra-rater reliability of the self-assessment the student is asked to self-assess every of his or her own speaking performance twice. The first self-assessment is done directly after the speaking task is completed and the second is done the following day while he or she is watching the video recording of his or her speaking performance. In this study, the intra-rater reliability means the consistency within the students in self-assessing their English speaking performance. It is assumed 8 8 that rater training is needed to improve the reliability. Rater training is structured activities that this study believes may help students do self-assessment on their speaking performance. According to Brown (2004:251), self-assessment is one of the forms of alternatives in assessment. It is to collect additional measures of students in an effort to triangulate data about students in minimizing the weaknesses of standardized tests. Self assessment offers certain benefits: direct involvement of students in their own destiny, the encouragement of autonomy, and increased motivation because of their self- involvement. For this reason the students are to have self-assessment skill. Furthermore, self-assessment is providing opportunity for students to do reflecting of their learning (O'Malley and Pierce 1996; Johnston 1987; Carroll and Hall 1985). Self- assessment is the basis for setting the individual learning goal (O'Malley and Pierce 1996). Since setting the learning goal is based on the self-assessment, the intra-rater reliability is required. If the students are not able to do self-assessment well, the learning goal will not be set appropriately. It may lead to the wrong learning direction. The need for improving the intra-rater reliability in doing self-assessment is crucial. Regardless of the advantages of employing student self-assessment, many teachers do not yet feel comfortable with it. In fact, "teachers do not believe in giving up this much control to students, whom they do not believe to be capable of self-assessment" (O'Malley and Valdez Pierce, 1996:36). They are concerned much with subjectivity, as not only professional teachers find difficulties in assessing productive language skills like speaking but even so the students. Speaking is "a skill which deserves attention every bit as much as literary skills" (Bygate 1987:vii). Students may be either underestimate or overestimate themselves, or they may not have the necessary tools to make an accurate assessment. Furthermore, "especially in the case of direct assessments of performance, they may not be able to discern their own errors" (Brown, H.D. 2004:270). In contrast, Bailey (1998) cited in Brown, H.D. (2004:270) conducted a study in which learners showed moderately high correlation (between .58 and .64) between self rated oral production ability and scores on the OPI. It tells us that learners' self-assessments may be more accurate than we might suppose. 8 9 There are some objections, which are usually due to the technical difficulties, that teachers do not want to employ self-assessment in speaking instructions. However, a recent study done by Liu Qin and Wang Li (2008) on a portfolio approach - which is self- assessment is considered as a crucial part of it - to oral English assessment found that it was workable and could yield reliable results to assess students' oral English proficiency. Moreover, the students participating in the experimental group showed that they were very happy to be able to reflect upon and assess their own progress and had more confidence in improving oral English proficiency. Hadley and Mort (1999), de Wet, der Walt and Niesler (2009) assume that there is a tight correlation between scoring system and rater reliability and they suggest that the scoring bands and their meaning need to be made explicit. Two observations on the use of self-assessment have been done, i.e. that on writing by Hall and that on oral English by Burn (both in Johnston, 1987: 125). The studies suggested teacher to let students explore the elements used for building good presentations and make the elements as the criteria for assessing their own work. The process will bring the students to understand the notion of setting their own learning goals. The study on inter-rater reliability done by Bailey (1998. cited in Brown, H.D. 2004:270) showed moderately high correlation (between .58 and .64) between self rated oral production ability and scores on the OPI. Meanwhile, the studies on the intra-rater reliability have rarely been conducted. Therefore, the study on improving the intra-rater reliability in the self-assessment of speaking performance through rater-training is worth conducted. "Intra-rater reliability is the consistency within raters (Bachman, 2004: 169)." In test scores that are obtained subjectively such as ratings of oral presentations, a source of error is inconsistency in these ratings. In the case of single rater, the concern with the consistency within that individual's ratings or with intra-rater reliability is required. Further description about intra-rater reliability is quoted from Bachman (1990:178-180): When an individual judges or rates the adequacy of a given sample of language performance, whether it is written or spoken, that judgment will be based on a set of criteria of what constitutes an 'adequate' performance. If the rater applies the same set of criteria consistently in rating the language performance of different individuals this will yield a reliable set of ratings. 9 0 In addition, Bachman (1990:178-180) recommends two ways to measure the intra-rater reliability of ratings, using Spearman rank-order coefficient or coefficient alpha. ...To examine the intra-rater reliability of ratings, at least two independent ratings from the rater for each individual language sample are needed. This is typically accomplished by rating the individual samples once and then re-rating them at a later time (in different, random order). Once the two sets of ratings have been obtained, the reliability between them can be estimated in two ways. One way is to treat the two sets of ratings as scores from parallel tests and compute the Spearman rank-order coefficient between the two sets of ratings, interpreting this as an estimate of reliability. ... The intra-rater reliability of the self-assessment of English speaking performance can be estimated if there is a students group of speaking instructions employs self- assessment. Conducting self-assessment on speaking performance calls for speaking rating scale. It is suggested that the elements included in the rating scale are the results of teacher-student agreement (O'Malley and Pierce, 1996). The purposes of the study are to measure the intra-rater reliability of the self- assessment of English speaking performance; and to find out whether there is significant difference between the self-assessment intra-rater reliability of speaking performance without training and that of with training. Method of Investigation The study consisted of three stages: (1) questionnaire survey on oral rating scale to set the rating scale; (2) measuring intra-rater reliability of the self-assessment of English speaking performance; and (3) finding the impact of rater training on the intra- rater reliability. A questionnaire survey was conducted to collect the students' opinions on actual rating scales (Please refer to Appendix I for the questionnaire.). It goes with Aschbacher's argument that "One of the characteristics of performance assessment is that the criteria are made public and known in advance" (1991, in O'Malley and Pierce, 1996: 9 1 03 Self-assessment of speaking performance 3 (direct & recorded): Telling someone's personality. It was conducted in week 9. Xj Repeated rater training 1 was delivered in week 11. 04 Self-assessment of speaking performance 4 (direct & recorded): Telling my job preferences. It was done in week 12. Xo Other experience being available in the absence of the treatment 05 Self-assessment of speaking performance 5 (direct & recorded): Text retelling. It was accomplished in week 14. Xi Repeated rater training 2 was through in week 16. 06 Self-assessment of speaking performance 6 (direct & recorded): Describing a country. It was carried out in week 17. As it is stated by Tuckman (1978:139) that "it is a form of time-series design but, rather than introducing the treatment (Xj) only a single time, it is introduced and reintroduced, with some other experience (Xo) being available in the absence of the treatment." The subjects of the study were 45 first-year students of D3 Program of Business Administration Department of Semarang State Polytechnic, academic year 2008/2009. The number has met the minimum requirement for doing the correlation analysis as it goes with what Mantra (in Singarimbun, M. and Effendi S. (Eds). 1989) states that if correlation technique is used in the analysis the minimal sample of 30 should be fulfilled. There were three instruments to conduct self-assessment and collect the data: (1) Questionnaire on oral rating scales which was used to set the rating scale to assess students' speaking performance (See Appendix 1); (2) Six speaking tasks that were adapted from The New Interchange Book 2 (Richards, Hull, and Proctor 1997): They were used to conduct the 3-minutes oral presentation; and (3) self-assessment form and the scoring guide (See Appendix 2 and 3). If the value of rs reaches 0.7, it can be concluded that the intra-reliability of self- assessment is considered high. It goes with Lado's statement (1961) cited in Hughes (2003: 39) that "oral production test may be in .70 to .79 range." He adds that "a 9 3 reliability coefficient of .85 might be considered high for an oral production test but low for a reading test." A comparison of the average of 0\t O3, and O5 with the average of O2, O^and will yield a result that is not likely to be invalidated by historical bias (Tuckman, 1978:140). The assumption is if the average of O2, O4, 0$ is higher than the average of 0\t Osf O5,the rater training likely improves the intra-reliability of the self-assessment. Table 1 The Analysis Design to Find out the Impact of the Rater Training on the Intra-rater Reliability of the Self-assessment of Speaking Performance First Second Third Administration Administration Administration Xj 02 04 o6 Xo Oj 03 o5 To guard the external validity, which is one weakness of employing the design, the study held three administrations instead of having two administrations. If the effect of Xj of the third administration is the same as its effect when introduced and reintroduced, then it would make valid conclusions about the continuous of Xi from a study using the equivalent time-samples design (Tuckman 1978). The rater training used was adapted from the model developed by Herman, Aschbacher and Winters (1992 as cited in O'Malley and Valdez Pierce, 1996). There are five phases described below. (1) Orientation to the assessment task • introducing the purposes of the assessment; • describing who will use the assessment results; • discussing the objective being assessed; • describing the prompts and student directions; • giving an overview of the scoring rubric; and • taking the assessment themselves so they understand the mental processes that are being called on as they take the assessment. (2) Clarification of the scoring rubric • Discussing the scoring rubric and its components in small groups; • Thinking back on the mental processes that are called for in responding to the prompt and how the rubric taps into these processes; and 9 4 • Reviewing the anchor performances (representative products or performances used to characterize each point on a scoring rubric or scale. (3) Practice scoring • Scoring a speaking performance in groups and individually and discussing the rates; • Taking notes while scoring, providing reasons why they assigned the scores; and • Attempting for establishing consensus in their ratings. (4) Check reliability: Comparing the students awarded scores with the teacher awarded scores to check the reliability. (5) Follow-up: The students were invited to do reflection on how they assessed the recorded speaking performances, to find out whether they overestimate or underestimate or they give about the same score awarded by the teacher. The limitation of the study is that the student-teacher and student-peers discussions on the recorded speaking performance before the students did the second self- assessment might cause bias in this study. Findings The Rating Scale for the Self-assessment of Speaking Performance The analysis of the core component of the questionnaire showed the first seven items among nine elements which have the highest mean scores. They were taken to put in the rating scale. They are grammar, pronunciation, loudness, vocabulary, body language, task and fluency. Two elements - cohesion and strategy - were excluded as they had the lowest means, 2.87500 and 3.06250 respectively. From the class discussion with the students, it was revealed that they felt that the two elements were too difficult to understand. There were only few answers to the open-ended question, and the answers were the clarification of the seven elements. Many previous studies did not include the element of loudness. Loudness is the volume of the speaker's voice. This study considers the element is important to improve since the fact that not only most female but also male subjects of the study do not speak loudly when they do presentation in front of the class, even the teacher who sits about one meter from the student can hardly catch the students' words. 9 5 A class discussion was held to assign the weight on every element. Table 2 shows the result of the analysis on the questionnaire responses and the class discussion. Table 2. The Core Component of the Questionnaire and the Weight of the Element ITEM ELEMENT MEAN NOTES WEIGHT 1 Task 3.56250 Chosen 2 2 Pronunciation 4.35417 Chosen 2 3 Vocabulary 3.93750 Chosen 1 4 Grammar 4.43750 Chosen 2 5 Fluency 3.33333 Chosen 1 6 Cohesion 2.87500 Not chosen - 7 Strategy 3.06250 Not chosen - 8 Body Language 3.72917 Chosen 1 9 Loudness 4.06250 Chosen 1 With reference to the analysis result, the seven elements were used in the self- assessment (Please refer to Appendix 2: Self-assessment Form). Intra-rater Reliability The scores awarded by the students were entered into the Excel worksheet and calculated to get the final scores. Based on the agreement on the given weight (please see Table 2) the equation to calculate the final score is (2XTask score + 2XPronunciation score + Vocabulary score + 2XGrammar score + Fluency score + Body Language score + Loudness score)/10. For example, if a student circles the highlighted numbers shown in the following table, the final score is (2X4+2X5+4+2X3+6+7)/10=41/10=4.1. The final scores were used as the data to observe the intra-rater reliability. Table 3. Example of Student's Awarded Score Item Statement not very well —->very well Notes 9 6 1 I can complete the task 1 2 3 I 5 6 7 2 I can pronounce the words 1 2 3 4 I 6 7 3 I use appropriate vocabularies 1 2 3 1 5 6 7 4 I use appropriate grammar 1 2 1 4 5 6 7 5 I keep my presentation fluent 1 2 3 4 1 6 7 6 I use eye contact, facial 1 2 3 4 5 I 7 expression, and gestures to help convey my ideas. 7 I can make my presentation 1 2 3 4 5 6 I audible. The results of the six observations - employing Spearman rank-order coefficient correlation - of the values of the intra-rater reliability of the self-assessment of the six speaking performances are presented in the Table 4. Table 4. The Intra-rater Reliability of the Self-assessment of the Speaking Performances Observation The Availability of the Treatment rs Observation 1 No rater training .611 Observation 2 Rater training .634 Observation 3 No rater training .652 Observation 4 Rater training .658 Observation 5 No rater training .682 Observation 6 Rater training .752 First observation was done to seek the intra-rater reliability of the self-assessment of the first speaking performances. The self-assessments were done on other experience being available in the absence of the treatment. The result articulates that the intra-rater reliability of the self-assessment of the first speaking performance is 0.611. Second 9 7 observation was completed to measure the intra-rater reliability of the self-assessment of the second speaking performance. They were done after the treatment. The calculation displays that the intra-rater reliability of the self-assessment of the second speaking performance is 0.634. So were third, fourth, fifth and sixth observations done to seek the intra-rater reliability of the self-assessment of the third, fourth, fifth and sixth speaking performances respectively. The findings show that the intra-rater reliability of the self- assessment of the third, fourth, fifth and sixth speaking performance are 0.652, 0.658, 0.682, and 0.752 respectively. Impact of Rater-training on the Intra-rater Reliability To find out whether there is significant difference between the self-assessment intra-rater reliability of speaking performance without training and that of with training, it is needed to compare the average of three Spearman coefficient correlations (rs) on other experience being available in the absence of the treatment and the average of three Spearman coefficient correlations (rs) on the treatment. If the average of three Spearman coefficient correlations (rs) on the treatment is higher than the average of three Spearman coefficient correlations (rs) on other experience being available in the absence of the treatment, it can be said that the treatment has impact on the intra-rater reliability of the self-assessment of speaking performance. The following table shows the comparison of the average of Oit 03, and O5 with the average of O2,0