LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

 
LLT Journal: A Journal on Language and Language Learning 

 http://e-journal.usd.ac.id/index.php/LLT 

Sanata Dharma University, Yogyakarta, Indonesia 
 

126 
 

RATER AGREEMENT AND DISAGREEMENT  

IN THE MEASUREMENT OF ENGLISH ARTICLE ACQUISITION 

SUPPLIANCE AND ACCURACY 

 
Rose Acen Upor 

University of Dar es Salaam 

upor@udsm.ac.tz 

correspondence: upor@udsm.ac.tz 

https://doi.org/10.24071/llt.v24i1.2603 

received 19 May 2020; accepted 25 February 2021 

 
Abstract  

This study combines language assessment processes and interlanguage analysis 

techniques to determine rater agreement and disagreement in assessing English 

article acquisition. Employing native English speaking and non-native English 

speaking raters, picture sequence narratives that were written by English as a 

Foreign Language (EFL) learners (n=97) were coded and scored for suppliance-

in-obligatory context (SOC) and target-like utterance (TLU). Although the kappa 

statistic revealed a fair agreement between raters (0.17 – 0.33), content analysis 

methods revealed much higher agreement (88.29% - 94.07%). Furthermore, 

language background effects between the raters could not be substantiated 

however the results demonstrated a discernable disagreement pattern between 

them. Thus, the study recommends the inclusion of a foreign language teaching 

background as a factor for rater selection to minimize language background 

effects on rating language assessments. 

 
Keywords: Article acquisition, Inter-rater agreement, Inter-rater disagreement, 

Language background effects 

Introduction 

Although the general relationship between language assessment and second 

language acquisition is relatively well established, the association with foreign 

language learning situations such as in Africa has not been clearly understood. 

Despite, the wide acknowledgment of the multidimensional research in language 

assessment studies, appraisal of foreign language learning situations has not been 

fully explored. Most studies of inter-rater reliability (IRR) on language 

assessment focus on tests of English proficiency and issues of rater assessment. 
Some of the issues identified include rater bias, rater background, rater 

severity/leniency and formats of testing. Other aspects include methodology, rater 

sample, and rater agreement, to mention a few. In some studies, rater bias has 

been shown to impact the results of proficiency tests in particular rater language 

background and rater severity (Caban, 2003; Johnson & Lim, 2009; Kim, 2009). 

In other studies, possible effects of rater training on levels of inter-rater agreement 

mailto:upor@udsm.ac.tz
mailto:upor@udsm.ac.tz
https://doi.org/10.24071/llt.v24i1.2603


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

127 

 
and rater severity were noted (Elder, Barkhuizen, Knoch, & von Randow, 2007; 

Elder, Knoch, Barkhuizen, & von Randow, 2005; Knoch, Read, & von Randow, 

2007; O’Sullivan & Rignall, 2007). Inter-rater reliability measures have also been 

used in studies that are not necessarily dependent on samples from language 

proficiency testing (Stolarova, Wolf, Rinker & Brielmann, 2014).  This paper 

intends to explore and bridge foreign language learning research and language 

assessment methods through measurement of suppliance and accuracy in article 

acquisition as part of a methodology in inter-rater agreement. The aim of the study 

is two-fold; first, it addresses the inter-rater reliability measures of the ability of 

learners to supply articles and determine the accuracy of these forms, second it 

determines inter-rater agreement and disagreement effects on article suppliance.  

In addressing the two aims of the study, this article is divided into 2 major 

sections. First, it builds on the existing body of research on the acquisition of 

English articles by adopting the Bickerton/Huebner model in determining the 

constructs for the rating scale (Bickerton, 1981; Huebner, 1983) and interlanguage 

analysis techniques in the collection of performance data (Pica, 1983). On one 

hand, the Bickerton/Huebner model is built on a taxonomy in the study of article 

use and it considers semantic and discourse-pragmatic features of the noun phrase 

(NP). According to the model, English NPs are classified based on referentiality 

i.e. specific reference [±SR] and hearer knowledge [±HK]. This allows for a 

comprehensive study of article use in four contexts namely, general reference 

(type 1), referential definite (type 2), indefinite reference (type3) and non-

referential (type 4) (Bickerton, 1981; Huebner, 1983). This framework made it 

possible to differentiate the underlying uses of the English article system in 

narratives and set a rating scale. On the other hand, the interlanguage analysis 

techniques adopted from Pica (1983) intend to provide statistical support in 

determining the instances of suppliance and accuracy of article use by EFL 

participants in the study. The Suppliance-in-Obligatory Contexts (SOC) and 

Target-Like-Utterance (TLU) measures provide a basis for the raters to determine 

the obligatory contexts for suppliance and accuracy of the English articles. Norris 

and Ortega (1983) indicate that these measures reveal differential patterns in 

learner types that would have gone undetected. They claim that naturalistic 

learners and instruction-only learners tend to have a smaller expressive 

vocabulary than instruction-plus-exposure learners. This illustrates that these 

measures have an increased sensitivity of analytical units and procedures that may 

contribute to a better understanding within a given theory. Second, the study also 

builds on the constructs of rater assessment so as to determine rater agreement and 

disagreement. To do so, the study uses the assessment data from the raters to 

perform statistical tests to determine the rate of agreement and disagreement. 
Through the findings, the paper shall explore minimally two constructs of 

language assessment, namely, rater language background influence and rater bias. 

These constructs are associated with the analysis based on the non-native and 

native English speaking raters involvement in the study.  

Hence, to expound on the relationship between language assessment and 

foreign language learning, and in particular, assessment of article suppliance and 

accuracy in narratives, the present study measured rater agreement and 

disagreement with a set of measures that span SLA and language assessment 

procedures. The findings of the study shall contribute to both the body of 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

128 
 

knowledge in language assessment and foreign language learning by providing 

insight into open-ended language assessment and the role of foreign language 

teaching experience in rater criteria selection.  

 
Acquisition of articles 

It is a commonly discovered fact that EFL/ESL learners face difficulties in 

acquiring the English article system. Different reasons cited for these difficulties 

include the complexities of the English articles themselves (Celce-Murcia & 

Larsen-Freeman, 1999), the lack of an equivalent article system in the learner’s 

native language (Mizuno, 2000) and a lack of effective teaching methods in 

English education (Yamada, 1982). Studies in the acquisition of English articles 

have approached from various viewpoints; the viewpoints of grammar (Yamada 

1982; Lyons 1999), of usage (Dilin & Gleason, 2002), of context (Huebner, 1985; 

Parrish, 1987; Ionin, Ko & Wexler, 2004) and a typology of nouns preceding 

articles (Chierchia, 1998; Ogawa, 2008).  

Evidence has shown that second language (L2) learners of English often 

have persistent difficulty in the use of articles until very late stages of acquisition 

or do not ever reach native-like levels of performance (Zdorenko & Paradis, 

2008), even when there is increased time in instruction (Master, 1987; Ogawa, 

2008). Some studies that have included comparisons of L2 learners from first 

language (L1) backgrounds with and without article systems suggest that L1 

transfer most likely plays a role in the L2 learners’ acquisition of English articles 

(Master, 1987; Murphy, 1997; Wakabayashi, 1997; Trademan, 2002; Hawkins, 

Al-Eid, Almahboob, Athanasopoulos, Chaengchenkit, Hu, Rezai, Jaensch, Jeon, 

Leung, Matsunaga, Ortega, Sarko, Snape, & Velasco-Zarate, 2006). Findings by 

Master (1987) indicate that there are variations that are considered in cases where 

L1s differ among subjects. However, the zero article (henceforth referred to as 

zero, Ø) dominates, which indicates that it is acquired first. Although the definite 

article, the, emerges early, there was evidence to indicate the-flooding in all 

environments.  It is also noted that [-ART] learners delay in the acquisition of a 

when compared with the. With the acknowledgment of variation in learners from 

different L1 backgrounds, the argument in the case was whether there was a role 

played by the L1 transfer and whether the learners fluctuated in article parameter 

setting.  Zdorenko and Paradis (2008) in their study of 17 ESL children 

discovered that the children substituted the definite article for the indefinite a in 

indefinite specific contexts regardless of the L1 background. Moreover, the 

children were more accurate in the use of the definite article in definite-specific 

contexts. The opposite was discovered by Jaensch (2008) who found that learners 

did not fluctuate between definiteness and specificity, although group 

comparisons proved that learners with higher proficiency outperformed learners 

with lower proficiency. Kaku (2006) brings forth an impelling perspective to 

article use. In his study of Japanese learner’s use of the, he discovered that the 

definite article is associated referentiality and with Japanese being a [-ART] 

language, he noticed that learners were reassembling the newly acquired feature in 

relation with their current use of the Japanese demonstratives for specificity. In 

terms of using SOC and TLU measures, Lu (2001) investigated the accuracy rate 

and the order of acquisition and observed a different order of emergence of the 

articles the>a>zero. Differentiation of orders could be attributed to the instruction, 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

129 

 
length of exposure, the participants themselves and/or the nature of the research 

tasks. Even where there were varied tasks performed by a group of learners, the 

results still yielded a systematic order of acquisition; however, the accuracy rate 

of the results was in question. The SOC measure is considered the most reliable 

index for accuracy levels (Lu, 2001).  

 
Inter-rater reliability tests in article acquisition 

Several studies have explored rater variability in both oral and written ESL 

performance assessment. Some of these studies focused on different rater 

backgrounds (Barnwell, 1989; Brown, 1995; Chalhoub-Deville, 1995; Chalhoub-

Deville & Wigglesworth, 2005; Fayer & Krasinski, 1987; Galloway, 1980; 

Hadden, 1991), others studied rater severity (Barnwell, 1989; Caban, 2003; Fayer 

& Kransinski, 1987; Johnson & Lim, 2009; Kim, 2009), while others focused on 

rater decision-making strategies (Barkaoui, 2010; Crisp, 2008; Cumming, 1990; 

Cumming, Kantor, & Powers, 2002; Huot, 1993; Lumley, 2005; Milanovic, 

Saville, & Shuhong, 1996; Sakyi, 2000; Vaughan, 1991), and others on the 

interaction between rater and criteria (Knoch et al., 2007; McNamara, 1996; 

Schaefer, 2008; Wigglesworth, 1993). A common thread among all these studies 

was the use of standardized language performance assessment as the basis of their 

investigation. A study by Richard Nickalls at the University of Birmingham 

employed four raters in determining the inter-rater reliability testing of article 

error tags by checking the extent raters would reliably classify article use as 

‘correct’ or ‘incorrect’ and if the correctness is consistently classified over time. 

The study used the Bickerton/Huebner Model and the raters received identical 

training. First, the raters tagged noun phrases for correctness using the online 

interface and three weeks later, the researchers tagged the same noun phrases 

again for correctness using the Bickerton/Heubner framework. The findings 

indicated that human raters were more reliable than automated computer methods. 

However, in terms of the Bickerton/Heubner framework, the findings showed that 

the raters could not use the framework consistently.  Nickalls (2013) argues that 

raters cannot apply classification frameworks, in which the decision goes beyond 

a rater’s dichotomous intuition especially in this case where they could not make 

reliable choices between generic, indefinite, non-referential and idiomatic 

contexts. 

It also needs to be pointed out that rater background has been shown to 

impact the results of language proficiency in test-takers. Studies of raters with 

diverse backgrounds, both linguistic and professional have been conducted. Some 

studies focused on rater severity based on rater background (Brown, 1995; 

Chalhoub-Deville, 1995), others on raters’ professional background (Hadden, 
1991) and linguistic background (Fayer & Kransinski, 1987; Kim, 2009). 

Findings from these various studies indicate that teachers and non-native speakers 

tend to be more severe in their assessments (Brown, 1995; Chalhoub-Deville, 

1995), teachers tend to be more severe than non-teachers (Hadden, 1991) and non-

native raters tend to be more severe (Fayer & Kransinski, 1987). Discrepant 

findings from Chalhoub-Deville (1995) and Brown (1995) indicate that teachers 

who participated in their studies were attendant to creativity and adequacy of 

information in a narration task and, there was no significant difference between 

the rating done by NS and NNS, respectively. Johnson and Lim (2009) have 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

130 
 

identified variables that could attribute to rater language background effects and 

intervene with the analysis when it comes to issues of NS and NNS raters. These 

issues included language distance affecting language performance (Elder & 

Davies, 1998); NS taking a more intuitive approach in rating (Brown, 1995), use 

of trained/untrained raters and different rating scales. These discrepancies call for 

further research into the area. 

 
Method 

Research questions 

 This present study will use data collected from Tanzanian EFL learners who 

were enrolled in 3different levels of education. The data were scored by 2 raters 

who possessed different language backgrounds. The study addressed the 

following research questions: 

a. Is there variability in the suppliance and accuracy of the English article 

acquisition among the EFL learners? 

b. To what extent will the raters agree in rating the article suppliance and 

accuracy? 

c. Is there an identifiable pattern to rater disagreement?  

 If there is an identifiable pattern to rater disagreement, can an argument be 

made regarding the language background of the raters? 

  
Participants 

 A total of 97 Tanzanian EFL learners participated in this study, 30 primary 

(elementary) school pupils (hereafter referred to as children), 30 secondary (high) 

school students (hereafter referred to as teenagers) and 19 students in their first 

year at University and 18 in their final year of university education. The 

elementary level students were enrolled in a public primary school in the outskirts 

of the city of Dar es Salaam. These are children who had at least 5 – 7 years of 

learning English as a subject, with all other subjects being taught in Swahili. The 

secondary school students were also enrolled in a public school; however, it is at 

this level of education that the medium of instruction shifts to all subjects being 

taught in English with Swahili as a subject. All university courses are taught in 

English with an exception for the Swahili language courses. 

 
Table 1. Descriptive characteristics of the study sample 

Characteristics N % 

Participants   

Children 30 30.9 

Teenagers 30 30.9 

First year 19 19.5 

Final Year 18 18.5 

Gender   

Total 97 100 

Male 50 51.5 

Female 47 48.5 

Mean Years of 

learning English 

  
Children 8.67 n.a. 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

131 

 
Characteristics N % 

Teenagers 9.14 n.a. 

First Year 11.82 n.a. 

Final Year 13.95 n.a. 

Number of 

languages spoken 

  
Two 67 69.1 

Three 27 27.8 

Four + 3 3.1 

First language   

Swahili 83 85.6 

Other  14 14.4 

 
The raters 

The participants’ narratives were scored by two raters. Both raters were trained in 

using SOC and TLU scoring methods. The rating scale was determined by the 

researchers following the Bickerton/Huebner model. Both raters were experienced 

instructors of English as a Foreign Language and had taught English to NNS 

through formal classroom instruction in environments where learners had limited 

language resources from which they could do language practice. Below is a 

profile of the raters: 

 
Table 2. Descriptive Characteristics of the Raters 

Characteristics Rater 1 Rater 2 

Language experience   

L1 Swahili English 

L2 English Vietnamese 

Other languages spoken Luo and Jita (rudimentary) Russian 

English language proficiency   

 NNS NS 

 Native-like proficiency Native speaker 

Gender   

 Female Female 

Professional experience   

Teaching 21 years 26 

Research 17 years 20 

 
Methodology 

Most studies on the acquisition articles have made use of language 

proficiency ascription for groups (Huebner, 1983; Jaensch, 2008; Kaku, 2006; Lu 

2001; Ogawa, 2008; Tarone 1985; Zdorenko & Paradis 2008;); however, in this 

study levels of proficiency were not considered instead the groups were identified 

and ascribed based on the level of schooling. Due to distinct characteristics in the 

larger adult group (university students), this group was split into two smaller 

groups; first year students and seniors. All of the participants were asked to write 

out a narrative from a text with picture sequences (See Appendix A). Different 

picture sequences for data collection were used in the study, however, it should be 

noted that variation in narratives does not affect the results or findings of a study 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

132 
 

(Ayoun & Salaberry, 2008). Each group of respondents was given different 

picture sequences for narration based on content, the number of years spent 

learning English and the difference in levels of education.  

 
Rating scale and data analysis procedures  

The picture sequences were designed to elicit narrative passages from the 

study participants. First, the researchers agreed on a protocol of their analysis 

before coding the data. They made use of suppliance-in-obligatory context (SOC) 

and target-like-use (TLU) measures. The first procedure, SOC is a method used to 

determine accurate suppliance of morphemes in linguistic environments in which 

the morphemes are required in Standard English. The basis for this analysis is 

that, if a participant produces an utterance such as ‘I have few books’, this speaker 

creates an obligatory context for use of the plural –s inflection. The reason behind 

this being that the participants appear to have acquired the rule of production of 

the morpheme, but have simply applied this rule to an exception (Pica 1983, Gass 

and Selinker 2001). This quantification method is represented in the following 

formula: 

 
SOC = number of correct suppliance x 2 + number of misformations 

Total obligatory contexts x 2 

 
In the second procedure, TLU is used to determine accurate use and 

distributional patterns for morphemes. This analysis was developed in light of the 

criticism that SOC analysis does not account for the over suppliance of a 

particular morpheme in inappropriate contexts (Pica 1983, Gass and Selinker 

2001). The method is represented as follows; 

 
TLU = number of correct suppliance in obligatory contexts 

Number of obligatory contexts + number of suppliance in nonobligatory contexts 

 
Analysis by SOC reveals how well participants had learned to produce a 

morpheme where it is required while analysis by TLU reveals how well 

participants have learned to control the production of that morpheme about where 

it is and is not required (Pica 1983). The results from the SOC and TLU were 

computed into percentages. To determine the interactions between the factors as 

well as individual factors, statistical procedures were performed on the data. 

These methods of morpheme quantification were adopted to demonstrate the 

ability of EFL learners in using articles as they write narratives. The following 

definitions of constituents in the measures were as follows; 

 
Correct suppliance: When the participants provide the correct form of the item in 

such a way that it does not make a construction ungrammatical 

Obligatory context: When the participants create a context of the use of an item in 

such a way that without it the construction is deemed ungrammatical and with it, 

the construction is deemed grammatical 

Misformation: When the participants provide an incorrect item in the context of a 

correct item in such a way that it deems the construction ungrammatical 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

133 

 
Non-Obligatory Context: When the participants provide an item in a context in 

which it was not required or not created for its inclusion 

 
After the defining constituents in the SOC and TLU, a rating scale was 

established for articles based on the types of forms and their functions in Standard 

English. The rating scale is as follows: 

  
Rating for Articles 

Step 1:  General or Specific to Specific 

  Does the narrative make use of articles in a general way? 

If yes → the beginning of the narrative will use ‘a/an’ and then move towards 

specific ‘the’. 

If no → the narrative will maintain the specific form ‘the’ from start to end, 
using the narratives to provide prior context for a specific reference. 

 
Step 2:  Naming 

 Do any of the narratives use the naming of characters? 

If yes → No article should appear before the noun form referring to the characters, 
which should be capitalized.   

If no → refer back to step 1. 
 

The scale was to be used as the researchers identified the SOC and TLU 

scores of the narratives. The analysis was conducted as follows: 1) the researchers 

independently reviewed and coded the written narratives to identify articles 

produced in each context as either correct suppliance, misformation, non-

obligatory context, and obligatory context, and; 2) the scores that the researchers 

awarded the SOC and the TLU were then entered into SPSS for further analysis 

 
Findings and Discussion  

Suppliance and accuracy of articles 

A one-way analysis of variance (ANOVA) was conducted on the scores of 

the groups' SOC and TLU to evaluate the relationship between the ability to 

supply the forms in the study and the accuracy of this suppliance within the 

different groups. A statistically significant difference was found among the four 

levels of EFL learner groups on the average SOC for articles (F (3, 93) = 18.80, p 

= .000) and on the average TLU for articles (F (3, 93) = 15.72, p = .000).  

 
Table 3. ANOVA Table for the SOC and TLU for Articles 
Items Sum of 

Squares 

df Mean 

Square 

F Sig. 

SOC Between Groups 16371.643 3 5457.214 18.798 

 
.000* 

 
Within Groups 26998.401 93 290.305 

Total 

 
43370.044 96 

TLU Between Groups 17888.655 3 5962.885 15.719 

 
.000* 

 
Within Groups 35277.842 93 379.332 

 Total 53166.497 96 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

134 
 

Due to the number of groups, a posthoc test was performed to uncover 

specific differences between the group means using the average SOC and TLU 

scores. The Games Howell test reveals that the four groups differed significantly 

in their ability in their suppliance and accuracy of articles. There was a significant 

difference in the suppliance of articles between the children group (p = <0.5) and 

the teenage group however there was no significant difference between the 

children group and the adult groupings. This limited variability between the 

children and adult groupings could be attributed to the length of the narratives and 

the number of correct formations. Although the children’s narratives were shorter, 

the magnitude of correct formations, misformation, and obligatory contexts was 

much similar to the adult groupings. Likewise, there were also significant 

differences between 1st-year students, teenagers, and final year students. In the 

accuracy of the articles, the test results indicated that the only group that was 

statistically significant from the rest of the groups was the teenage group 

(p=<0.5). This significance is important because it was within this group that both 

raters experienced very short narratives, high instances of naming and inconsistent 

use of capitalization compared to the other groups, therefore, proving a challenge 

to the raters. Furthermore, it is the same group that was consistently outperformed 

by the other groups in terms of both suppliance and target-like use of articles. The 

other group that has also shown to be significantly different based on this test is 

the final year adult group (p=<0.5). This group has illustrated a significant 

difference from the other groups in terms of the average identifying of contexts of 

use of articles. Table 4 illustrates the results of the Games-Howell tests on the 

groups’ average TLU and SOC. 

 
Table 4. Games-Howell Test of the Average SOC and TLU of Articles 
Dependent 

Variable 

(I) Age 

Groups 

(J) Age 

Groups 

Mean 

Difference 

(I-J) 

Std. 

Error 

Sig. 95% Confidence 

Interval 

Lower 

Bound 

Upper 

Bound 

Average 

SOC for 

Articles 

Children Teens 27.29166* 4.99206 .000* 13.9603 40.623

0 

1st Year  7.35311 4.31444 .338 -4.3316 19.037

8 

Final 

Year  

-5.63480 2.81565 .203 -13.1424 1.8728 

Teens Children -27.29166* 4.99206 .000* -40.6230 -

13.960

3 

1st Year  -19.93855* 5.72807 .006* -35.1957 -4.6814 

Final 

Year  

-32.92646* 4.70365 .000* -45.5936 -

20.259

3 

1st Year  Children -7.35311 4.31444 .338 -19.0378 4.3316 

Teens 19.93855* 5.72807 .006* 4.6814 35.195

7 

Final 

Year  

-12.98791* 3.97718 .016* -23.9380 -2.0378 

Final 

Year  

Children 5.63480 2.81565 .203 -1.8728 13.142

4 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

135 

 
Dependent 

Variable 

(I) Age 

Groups 

(J) Age 

Groups 

Mean 

Difference 

(I-J) 

Std. 

Error 

Sig. 95% Confidence 

Interval 

Lower 

Bound 

Upper 

Bound 

Teens 32.92646* 4.70365 .000* 20.2593 45.593

6 

1st Year  

 
12.98791* 3.97718 .016* 2.0378 23.938

0 

Average 

TLU for 

Articles 

Children Teens 24.81160* 5.43016 .000* 10.4023 39.220

9 

1st Year  6.52689 5.45956 .634 -8.1981 21.251

8 

Final 

Year  

-12.59477* 4.36680 .030* -24.2578 -.9317 

Teens Children -24.81160* 5.43016 .000* -39.2209 -

10.402

3 

1st Year  -18.28471* 6.28700 .028* -35.0673 -1.5022 

4th Year  -37.40637* 5.36548 .000* -51.7139 -

23.098

9 

1st Year  Children -6.52689 5.45956 .634 -21.2518 8.1981 

Teens 18.28471* 6.28700 .028* 1.5022 35.067

3 

Final 

Year  

-19.12166* 5.39524 .007* -33.7553 -4.4880 

Final 

Year s 

Children 12.59477* 4.36680 .030* .9317 24.257

8 

Teens 37.40637* 5.36548 .000* 23.0989 51.713

9 

1st Year  19.12166* 5.39524 .007* 4.4880 33.755

3 

* The mean difference is significant at the .05 level. 

Inter-rater Agreement 

Three separate tests were involved in determining the rate of agreement and 

disagreement between the two raters i.e. Cohen’s kappa, Holsti’s content analysis, 

and Scott’s pi. Cohen’s kappa statistic is frequently used to measure the 

agreement between two raters. The cross-tabulation between the rating of 

suppliance and accuracy of articles shows that there is an agreement between the 

two raters. The symmetric measures table shows that Kappa for each level of 

rating between the raters indicates fair agreement for correct formations (.29), 

misformations (.30) and non-obligatory contexts (.33) and slight agreement (0.17) 

for obligatory contexts as shown in Table 6. 

 
Table 6: Symmetric Measures of Cohen’s Kappa between the two raters 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

136 
 

a.   

Not assuming the null hypothesis. 

b.   Using the asymptotic standard error assuming the null hypothesis. 

 
These results indicate a large amount of disagreement than expected 

between the raters. In as much as the kappa is used to measure inter-rater 

agreement, its strength lies in the fact a study has collected correct representations 

of the variables measured (McHugh, 2012). A probable explanation for this low 

agreement could be a symmetrical imbalance between the two raters. However, 

the kappa statistic is also known to have its limitations. The terms symmetrical, 

asymmetrical, imbalance, prevalence, and bias have been used to describe the 

limitations associated with the statistic (Flight & Julious, 2015). The most 

probable explanation for low kappa in the context of the study would be the 

problem of oversuppliance errors as predicted by Pica (1983) which point towards 

prevalence in this case. Moreover, Feinsten and Cicchetti (1990) highlight what 

they refer to as ‘paradoxes’ of the kappa. They indicated that asymmetric, 

imperfectly imbalanced tables have higher kappa than perfectly imbalanced 

symmetric tables. Also where there were high values of agreement, lower values 

of kappa were recorded. Based on this observation, we could predict that because 

of the low kappa recorded, probable high values shall be recorded in through other 

indices. Most of the studies that have recorded limitations in the kappa statistic are 

health-related studies (Flight & Julious, 2015; McHugh, 2012; Tang, Hu, Zhang, 

Wu, & He, 2015).  

Although a Prevalence and Bias Adjusted Kappa (PABAK) is proposed to 

overcome the limitations of the kappa statistic (Byrt, 1993), this study chose to 

use the content analysis method proposed by Holsti (1969). The two-stage process 

was chosen: first, to determine the degree of token-based agreement among the 

raters and second, to determine the degree of agreement through traditional 

inferential statistics. The first part of the analysis contains a count of the tokens of 

articles between the two raters for the participants and use Holsti’s method (1969) 

for determining the agreement. The method is a variation of percentage 

agreement, a measure that is popular and easy to understand and calculate, yet it 

can be applied to more than two coders (Lombard et al., 2002), unlike for Holsti’s 

method that is limited to two coders as evidenced in its formula. 

 
Item Value 

Asymp. 

Std. Errora Approx. Tb 

Approx. 

Sig. 

Correct formations .293 .048 15.067 .000* 

Misformations .300 .061 6.403 .000* 

Obligatory contexts .170 .041 9.503 .000* 

Non-obligatory 

contexts 
.330 .086 4.320 .000* 

N of Valid cases 97       


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

137 

 
Coefficient of Reliability = 

2M 

M is the number of judgments on 

which both of the coders agree 

N1 + N2 N1 and N2 are the total number 

of judgments made by both 

coders 

Source: Holsti, O. R. (1969). Content analysis for the social sciences and 

humanities, pp140 

 
Table 7 presents the description of the results of the narratives, showing total 

use (number of tokens) and percentage usage by the group and by the rater. Table 

6 is followed by Table 8 that summarizes the information from Table 7. 

 
Table 7. Step by Step descriptives and coefficients of reliability by group and rater 
Group Rating items Rater1 Rater2 Agreement 2M N1 + 

N2 

C.R. 

(%) 

Children Suppliance-in-

obligatory 

context 

Corr 197 215 194 388 412 94 

Mis 25 35 18 36 60 60 

Oblig 229 264 223 446 493 90 

Total 451 514 435 870 965 90 

Target-like use Corr 197 215 194 388 412 94 

Oblig 229 264 223 446 493 90 

Non 11 17 9 18 28 64 

Total 437 496 426 852 933 91 

Teens Suppliance-in-

obligatory 

context 

Corr 245 270 236 472 515 92 

Mis 80 56 50 100 136 74 

Oblig 500 464 426 852 964 88 

Total 825 790 712 1424 1615 88 

Target-like use Corr 245 270 236 472 515 92 

Oblig 500 464 426 852 964 88 

Non 9 17 2 4 26 15 

Total 754 751 664 1328 1505 88 

First Year 

Students 

Suppliance-in-

obligatory 

context 

Corr 392 415 378 756 807 94 

Mis 64 66 45 90 130 69 

Oblig 517 532 492 984 1049 94 

Total 973 1013 915 1830 1986 92 

Target-like use Corr 392 415 378 756 807 94 

Oblig 517 532 492 984 1049 94 

Non 17 22 2 4 39 10 

Total 926 969 872 1744 1895 92 

Final Year 

Students 

Suppliance-in-

obligatory 

context 

Corr 607 643 602 1204 1250 96 

Mis 36 39 31 62 75 83 

Oblig 653 715 652 1304 1368 95 

Total 1296 1397 1285 2570 2693 95 

Target-like use Corr 607 643 602 1204 1250 96 

Oblig 653 715 652 1304 1368 95 

Non 0 5 0 0 5 0 

Total 1260 1363 1254 2508 2623 96% 

 
LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

138 
 

Table 8. Summary of Descriptives and Coefficients of Reliability 
 Suppliance in Obligatory Context 

(SOC) 

Target Like Utterance (TLU) 

Correct Mis Oblig Total Correct Oblig Non Total 

Rater 1 1441 205 1899 3545 1441 1899 37 3377 

Rater 2 1543 196 1937 3676 1543 1937 61 3541 

Agreement 1410 144 1831 3385 1410 1831 13 3254 

2M 2820 288 3662 6770 2820 3662 26 6508 

N1 + N2 2984 401 3836 7221 2984 3836 98 6918 

C.R. (%) 94.50 71.82 95.46 93.75 94.50 95.46 26.53 94.07 

KEY: 

  N1 Count of instances by rater 1 

  N2 Count of instances by rater 2 

2M Expected total IFF the raters agreed on all instances/twice 

the agreement count 

  C.R Coefficient of Reliability 

 
In summation, the coefficients used to calculate inter-rater reliability were 

reported in most of the articles (94.07%, n=97). Rater agreement in the suppliance 

of articles in obligatory contexts and target-like use in obligatory contexts was 

reported at 95.46% as the most frequent coefficient. The area of disagreement 

between the researchers was the use of articles in non-obligatory contexts 

(26.53%) whereas there was a satisfactory agreement when it came to 

misformations. Overall, both raters agreed 2820 times out of 2984. A major 

drawback of Holsti’s method reported is the lack of ability to calculate the 

agreement by chance (Wang, 2011). Due to this weakness, we adopted a third 

index, Scott’s pi (π), which not only improves on simple percent agreement but 

also takes into consideration category values and accounts for chance agreement 

(Wang, 2011). Scott’s pi (π) was used to determine inter-rater reliability and its 

results were used to check rater bias and language background effects. 

 
Inter-rater reliability and language background effects 

The coding for the reliability sample included identification of all instances 

of correct suppliances, misformations, obligatory contexts and non-obligatory 

contexts in all 97 narratives. In as much as the raters worked independently in 

coding the samples, the researchers used Scott’s pi (π) for verification of the 

reliability and inter-rater agreement.  The equation for Scott’s pi is: 

 
Where:  Pr(a) = observed agreement between coders 

  Pr (e) = expected agreement between the coders 

 
To obtain coefficients of reliability for Scott’s pi scores, the raters compared 

each instance of agreement in each narrative for articles SOC and TLU categories. 

The results indicated consistency in inter-rater reliability. However, it was 

anticipated that issues would arise from the teen group since it was the only group 
that had a completely different perspective towards the narrative exercise. This 

group chose to name the characters rather than objectifying them as they would 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

139 

 
have appeared in the text. This necessitated revision of the rating scale to include 

naming since there were significant differences in how the raters chose to address 

the issue. The coefficient of reliability for all cases was 88.52% (Articles SOC) 

and 88.29% (Articles TLU).  Table 9 illustrates the inter-rater scores using Scott’s 

pi (π). 

 
Table 9. Scott’s pi (π) Inter-rater Reliability 
Items Children 

(%) 

Teenagers 

(%) 

First Year 

(%) 

Fourth Year 

(%) 

Overall (%) 

Articles SOC 82.19 77.89 85.76 91.31 88.52 

Articles TLU 83.46 75.10 84.43 91.25 88.29 

 
Apart from reaching the inter-rater reliability for raters, the need for 

determining patterns of disagreement was important with regards to the rater 

profile, i.e. NS and NNS. Out of 97 participants, it was noted that Swahili was the 

L1 for 83 participants and L2 for 14 participants, English was L2 and L3 

respectively. Rater 1’s L1 is Swahili and it may be inferred from the research on [-

ART] languages as to background effects on their rating unlike for rater 2, whose 

was L1 was English. Bias terms were measured for each of the raters despite the 

absence of an English L1 participant. The bias terms followed the SOC and TLU 

scores of each rater per participant where a total of 86 participant scores fell 

within the Z score range of -1.96 and +1.96 using a 95% confidence level. Only 

11 participants’ scores fell out of range. This indicates that disagreement effects 

were not significant as expected because the magnitude of bias was not 

substantive and both raters contributed to the bias. Where bias was exhibited, it 

was discovered that most of the cases were found in one particular group of 

participants. Table 10 illustrates the bias terms by participant. 

 
Table 10. Bias terms by participant 
SOC TLU 

Participant # 
SOC TLU 

≤ -1.96 ≤ -1.96 ≥ 1.96 ≥1.96 

   R2 32   

R1   R2 35*   

R1   R2 43*   

R1 R2 R1 R2 49*   

R1 R2  R2 51*   

   R2 56*   

R1 R2 R1  57*   

R1 R2 R1  58*   

R1    66*   

R1  R1 R2 67*   

 R2  R2 72   

Key: * teenage group 

 
Table 10 indicates that rater 1, as an NS of Swahili, was biased when 

participants supplied articles in the obligatory contexts (production) than rater 2 

who was more inclined towards the accuracy of the use of the articles 

(performance) by the participants. Using the notion of the directionality of 

severity even though bias, in this case, does not entail severity (Johnson & Lim, 

2009), it is noted that both raters’ biases were negative numbers and were 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

140 
 

clustered between -3.656 and -1.988. Although Johnson and Lim (2009) made use 

of a different analysis index from the one adopted in this study, their analysis 

claimed that positive numbers indicate harshness and negative numbers indicate 

leniency. This could be loosely interpreted that the raters had a similar inclination 

towards leniency and were consistent in their observations of the data. This 

observation supports the findings of the study by Kim (2009) that indicates the NS 

and NNS raters showing consistency. However, the results do not support studies 

(Chalhoub-Deville, 1995; Brown, 1995; Chalhoub-Deville & Wigglesworth, 

2005; Hadden, 1991) that noted significant differences in how NS and NNS raters 

behave, and with NNS and teachers being more severe in their assessments. One 

possible cause for the consistency found in this study could be the experience that 

both raters had with foreign language teaching. Moreover, the issue of NS and 

NNS is fluid in this study because there is a rater who happens to be an NS of an 

L1 that is shared by over 85.5% of the study participants as well as being an NNS 

with near-native fluency to the language of study. This raises the question of the 

application of intuitive knowledge by the raters. Despite the use of a rating scale, 

suppliance, and accuracy judgments, it became evident that some judgments were 

also made based on each rater’s intuition and perception of student intent. 

Inconsistent student use of capitalization, inconsistent use of the definite article, 

and spelling mistakes further complicated the rating process. Although Scott’s pi 

places the inter-rater reliability at an average of 88.52% (SOC) and 88.29% 

(TLU), subjective impressions from initial agreement analyses revealed that there 

may be patterns to the non-agreement (11.59%), with misformation and non-

obligatory context as frequent areas of non-agreement. Despite the perception of 

systematic non-agreement between raters, the disagreement was not statistically 

significant. Disagreement occurred primarily in narratives that used capitalization 

variably, which was perceived by one rater as naming (no article required), but by 

the other as misformation.  Because of this limited effect, we believe that rater 

language background effects were not significant. 

 
Conclusion      

This study was guided by three research questions; i) is there variability in 

the suppliance and accuracy of the English article acquisition among the EFL 

learners?; ii) to what extent will the raters agree in rating the article suppliance 

and accuracy? and; iii) is there an identifiable pattern to rater disagreement? If 

there is an identifiable pattern to rater disagreement, can an argument be made 

regarding the language background of the raters?  

Regarding the performance of the learners on the narrative task, variability 

was found to be significant among the four groups that participated in the study. 

Further analysis revealed that the results on the suppliance and accuracy of 

articles confirm that native-like performance for the more advanced participants 

has not been reached despite the increased time of instruction compared to other 

participants of the study (Zdorenko & Paradis, 2008; Masters, 1987; Ogawa, 

2008). Even though for 11 out of 18 of the advanced participants English was an 

L3, there is no indication of any substantial effect on the overall results. Higher 

proficiency in article suppliance and accuracy was found in the advanced 

participants which support findings by Jaensch (2008) and can be attributed to the 

increased time of instruction (mean years of learning = 13.95). A methodological 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

141 

 
choice was made to leave the Ø article out of the analysis and focus on the 

definite and indefinite articles according to the specifications of the rating scale. 

The issue of the-flooding was not an area of focus and where it occurred it was 

considered as a misformation. Evidence of fluctuation can be implied by the 

performance of the teens' group (mean years of learning = 9.14 years). Also, the 

findings are indicative of U-shaped learning and it can be assumed that the 

learners are at the stage of parameter setting (Zdorenko & Paradis, 2008). This 

particular group also exhibited the use of the distal demonstrative ‘that’ to 

substitute the referential function of the definite article. Similar sentiments are 

expressed by Kaku (2006) who found Japanese learners of English using 

demonstratives for specificity. In terms of the learner performance and the coding 

decisions between the raters, consistency in articles was relative and when it 

occurred, it was seemingly governed by the learners’ perception of the semantic 

function of the characters in the narratives and character-character interaction. In 

regards to how well the four groups of English language learners used articles, the 

study revealed there was a significant difference between the four groups in SOC 

and TLU measures. Follow-up discussion of the perception of student intent and 

exploration of disagreement between the raters discovered that there were 

systematic shifts in anaphoric use of articles in the narratives. This could be 

explained as an L1 effect in the learners. \ 

With regards to the preceding research questions on rater agreement, the 

researchers used inter-rater reliability and inter-rater agreement measures in what 

may be considered traditional SLA tests of learner ability to produce articles by 

measuring SOC and TLU scores. In using these tests, we find that it is 

constructive and it bridges language testing methods to SLA research. Through 

the combination of SOC and TLU measures, inter-rater agreement and inter-rater 

reliability and SOC and TLU methods employed, the findings of the study have 

revealed through two inter-rater agreement indices that there is a very high level 

of agreement whereas in one index there seems to be fair to slight level of 

agreement. Feinsten and Cicchetti (1990) confirm that there is a tendency of a low 

kappa statistic recorded with high agreement levels as we have found in this 

study. It is important to note that the study did not make use of final scores of the 

narratives as would in most IRR studies but rather the scores of the raters’ 

judgments of production and accuracy of English articles as interpreted in the 

narratives. This method contributes to the body of knowledge on rater agreement 

studies in that teasing apart the aspects of measurements may provide insight into 

levels of agreement. Furthermore, the analysis indicates that the language 

background of the raters does not influence agreement between them. The 

evidence of support is found in the bias terms as indicated in Table 10 which 
indicates consistency between the raters. It further signifies that the raters shared 

challenges in rating the same narratives of the participants. Additionally, it points 

out that experience in foreign language teaching had a role to play in how the 

raters viewed these same narratives even more so the language proficiency of the 

NNS rater. The study has proven that where studies involving NNS with above 

intermediate proficiency, the likelihood for them to rate at almost the same level 

of the NS is very high. Johnson and Lim (2009) hypothesize that NNS raters 

could rate performance assessments differently because they possess a language 

background from places with well-developed varieties of English thus causing 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

142 
 

them to overlook or accept features that are unacceptable in a standard dialect. 

This has not been the case in this study. Still, the major question also lies in how 

much of the rater’s intuitive knowledge of the language matter is being used, 

which cannot be measured or observed as part of the rating scale that has been 

agreed upon. A major conclusion of this study is that training of the rating scale 

and probably the experience of the raters minimizes the language background 

effects and other possible biases. However, it does not eliminate the possibility of 

rater focus on particular areas of rating that emanate from their intuitive 

knowledge and use of the language of assessment.  

This study acknowledges and addresses some methodological limitations 

faced in the analysis processes. First, the study employed labor-intensive 

procedures in the coding and analysis of the data. This intensity is evident in the 

rating scale, SOC and TLU measures, narrative method and the Holsti method. 

The SOC and TLU measures are not common methods in the collection of data 

for IRR studies but through this study, it has proven to be a means through which 

individuality and freedom of rater judgments can be achieved. Second and closely 

related to the first limitation is the design of the rating scale. The rating scale not 

only allows for individuality and freedom of the rater judgments but it can also 

allow for intuitive methods that rely mostly on the interpretation of the raters 

about the learner narratives. The Holsti method allowed the raters to revisit each 

instance they coded painstakingly and determine the level of agreement and 

disagreement. Both raters, however, had previous experience of using the SOC 

and TLU measures, therefore, limiting the training time of the adopted scale in the 

study. Third, the number of raters involved in the study does not strongly provide 

a basis for rater language background influence argument in comparison to most 

studies on rater language background effects. The study had only two raters of 

varying English language background, as a result, it only amplifies issues that 

could arise from rating systems of language tests that may have not been 

standardized; consider the SOC and TLU measures as well as the use of 

narratives. Methodological choices of this nature may sometimes permit 

unreliable conclusions where rating lacks a systematic procedure and as a result, it 

inadequately expresses the proficiency of a learner but it can also provide grounds 

for developing systematic procedures for analyzing learner compositions. Based 

on these three limitations, it is prudent to argue that generalizability of the results 

would require some amount of caution. 

In conclusion, this study suggests that the kappa coefficient may not be 

sufficient in expressing inter-rater agreement as also indicated in other studies 

(Flight & Julious, 2015; McHugh, 2012, Tang, et.al. 2015). It proposes the use of 

other indices that may support the results acquired through Cohen’s kappa. 

Evidence from the study also supports that training in the rating scale rubric 

(Johnson & Lim, 2009) is an important factor in the scoring of the assessments, 

however, the study also emphasizes the importance of the experience of the raters 

in foreign language teaching as an important factor in minimizing language 

background effects in cases where NS and NNS raters are used. Due to this 

observation, the study could not provide a concrete argument as there being any 

language background effects in the assessment of the narratives. 

 
LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

143 

 
References 

Barnwell, D. (1989). ‘Naïve’ native speakers and judgments of oral proficiency in 

Spanish. Language Testing, 6, 152–163. 

Bickerton, D. (1981) Roots of language. Ann Arbor, MI: Karoma.  

Brown, A. (1995). The effect of rater variables in the development of an 

occupation- specific language performance test. Language Testing, 12, 1–

15. 

Byrt, T., Bishop, J. & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of 

Epidemiology, 46(5): 423-429 

Caban, H. L. (2003). Rater group bias in the speaking assessment of four L1 

Japanese ESL students. Second Language Studies, 21, 1–44. 

Celce-Murcia, M. & Larsen-Freeman, D. (1999). The grammar book: An 

ESL/EFL teacher’s course (2nd Ed.), Boston: Heinle & Heinle Publishers 

Chalhoub-Deville, M. & Wigglesworth, G. (2005). Rater judgment and English 

language speaking proficiency. World Englishes, 24, 383–391. 

Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different 

tests and rater groups. Language Testing, 12, 16–33. 

Chierchia, G. (1998). Plurality of mass nouns and the notion of ‘semantic 

parameter’. In S. Rothstein (Ed.), Events and Grammar (pp 53-103). 

Kluwer: Dordrecht. 

Crisp, V. (2008). Exploring the nature of examiner thinking during the process of 

examination marking. Cambridge Journal of Education, 38, 247–264. 

Cumming, A. (1990). Expertise in evaluating second language compositions. 

Language Testing, 7, 31–51. 

Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating 

ESL/EFL writing tasks: A descriptive framework. Modern Language 

Journal, 86, 67–96. 

Dilin, L. & Gleason, J.L. (2002). Acquisition of the article the by non-native 

speakers of English: An analysis of four non-generic uses, Studies in Second 

Language Acquisition, 24(1), 1-26. 

Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater 

responses to an online training program for L2 writing assessment. 

Language Testing, 24, 37–64. 

Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual 

feedback to enhance rater training: Does it work? Language Assessment 

Quarterly, 2, 175–196. 

Fayer, J. M. & Krasinski, E. (1987). Native and nonnative judgements of 

intelligibility and irritation. Language Learning, 37, 313–326. 

Feinsten, A. R. & Chicchetti, D.V. (1990). High agreement but low kappa: The 
problems of two paradoxes, Journal of Clinical Epidemiology, 43, 543-548. 

Flight, L., & Julious, S. A. (2015). The disagreeable behavior of the kappa 

statistic. Pharmaceutical Statistics, 14(1), 74-78. 

https://doi.org/10.1002/pst.1659  

Galloway, V. B. (1980). Perceptions of the communicative efforts of American 

students of Spanish. Modern Language Journal, 64, 428–433. 

Hadden, B. L. (1991). Teacher and nonteacher perceptions of second-language 

communication. Language Learning, 41, 1–24. 

https://doi.org/10.1002/pst.1659


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

144 
 

Hawkins, R., Al-Eid, S., Almahboob, I., Athanasopoulos, P., Chaengchenkit, R., 

Hu, J., Rezai, M., Jaensch, C., Jeon, Y., Leung, Y-K.I., Matsunaga, K., 

Ortega, M., Sarko, G., Snape, N. & Velasco-Zarate, K. (2006) Accounting 

for English article interpretation by L2 speakers. In Foster-Cohen, S.H., 

Medved Krajnovic, M. and Mihaljevic Djigunovic, J. (eds) EUROSLA 

Yearbook, Volume 6, 7-25 

Holsti, O. R. (1969). Content analysis for the social sciences and humanities, 

reading. MA: Addison-Wesley. 

Huebner, T. (1985). System and variability in interlanguage syntax. Language 

Learning, 35, 141-163 

Huebner, T. (1983). A longitudinal analysis of the acquisition of English. Ann 

Arbor, MI: Karoma. 

Huot, B. A. (1993). The influence of holistic scoring procedures on reading and 

rating student essays. In M. M. Williamson & B. A. Huot (Eds.), Validating 

holistic scoring for writing assessment: Theoretical and empirical 

foundations (pp. 206–236). Cresskill, NJ: Hampton Press. 

Ionin, T., Ko, H.  & Wexler, K. (2004) Article semantics in L2 acquisition: The 

role of specificity. Language Acquisition, 12(1), 3-69 

Jaensch, C. (2008). L3 acquisition of articles in German by native Japanese 

speakers. In Proceedings of the 9th Generative Approaches to Second 

Language Acquisition Conference (GASLA 2007). Somerville, MA: 

Cascadilla Proceedings Project (Vol. 8189, No. 2009, p. L3). 

Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background 

on writing performance assessment. Language Testing, 26, 485–505. 

Kaku, K. (2006). Second language learners’ use of English articles: A case of 

native speakers of Japanese. Cahiers Linguistiques d’Ottawa/Ottawa Papers 

in Linguistics, 34, 63-74. 

Kim, Y.-H. (2009). An investigation into native and non-native teachers’ 

judgments of oral English performance: A mixed methods approach. 

Language Testing, 26, 187–217. 

Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: 

How does it compare with face-to-face training?. Assessing Writing, 12, 26–

43. 

Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass 

communication: Assessment and reporting of intercoder reliability. Human 

Communication Research, 28, 587-604. 

Lu, C.F-C. (2001) The acquisition of English articles by Chinese learners, Second 

Language Studies, 20, 43-78. 

Lumley, T. (2005). Assessing second language writing: The rater’s perspective. 

Frankfurt, Germany: Lang. 

Lyons, C. (1999). Definiteness. Cambridge: Cambridge University Press. 

Master, P. A. (1987). A cross-linguistic interlanguage analysis of the acquisition 

of the English article system (Doctoral dissertation, UCLA). 

McHugh, M. L. (2012). Interrater reliability: The kappa statistic, Biochem Med 

(Zagreb), 22(3), 276-282. 

McNamara, T. (1996). Measuring second language performance. New York, NY: 

Addison Wesley Longman Limited. 


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

145 

 
Milanovic, M., Saville, N., & Shuhong, S. (1996). A study of the decision-making 

behaviour of composition markers. In M. Milanovic & N. Saville (Eds.), 

Performance testing, cognition and assessment: Selected papers from the 

15th Language Testing Research Colloquium (pp. 92–114). Cambridge, UK: 

Cambridge University Press. 

Murphy, S. (1997) Knowledge and production of English articles by advanced 

second language learners, Unpublished doctoral dissertation, University of 

Texas at Austin. 

Nickalls, R. (2013). Inter-rater reliability testing of article error tags: an 

argument for framework simplicity. Poster session presented at the Learner 

Corpus Research Conference, Bergen, Norway, Retrieved from 

https://lcr2013.w.uib.no/files/2013/09/Nickalls-poster.pdf  

Norris, J. & Ortega, L. (2003). Defining and Measuring SLA. In C. J. Doughty & 

M.H. Long (Eds.) The Handbook of Second Language Acquisition (pp 717 – 

760). https://doi.org/10.1002/9780470756492.ch21  

Ogawa, M. (2008) The acquisition of English articles by advanced EFL Japanese 

learners: Analysis based on noun types, Journal of Language and Culture 

Language and Information 3, 133-151 

Parrish, B. (1987) A new look at methodologies in the study of article acquisition 

for learners of ESL, Language Learning 37, 361-83 

Pica, T. (1983). Methods of morpheme quantification: Their effect on the 

interpretation of second language data. Studies in Second Language 

Acquisition, 6(1), 69-78. 

Sakyi, A. A. (2000, October). Validation of holistic scoring for ESL writing 

assessment: How raters evaluate. In Fairness and validation in language 

assessment: Selected papers from the 19th Language Testing Research 

Colloquium, Orlando, Florida (Vol. 9, p. 129). Cambridge University Press. 

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language 

Testing, 25, 465–493. 

Stolarova, M., Wolf, C., Rinker, T., & Brielmann, A. (2014). How to assess and 

compare inter-rater reliability, agreement and correlation of ratings: An 

exemplary analysis of mother-father and parent-teacher expressive 

vocabulary rating pairs. Frontiers in psychology, 5, 509. 

Tang, W., Hu, J., Zhang, H., Wu, P., & He, H. (2015). Kappa coefficient: A 

popular measure of rater agreement. Shanghai Archives of Psychiatry, 27(1), 

62-67. https://dx.doi.org/10.11919%2Fj.issn.1002-0829.215010  

Tarone, E. (1985). Variability in interlanguage use: A study of style-shifting in 

morphology and syntax, Language Learning, 35, 373-404 

Trademan, J. (2002). The acquisition of English article system by native speakers 
of Spanish and Japanese: a cross-linguistic comparison (Unpublished PhD 

dissertation, University of New Mexico). 

Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. 

Hamp-Lyons (Ed.), Assessing second language writing in academic contexts 

(pp. 111–125). Norwood, NJ: Ablex. 

Wakabayashi, S. (1997). The acquisition of functional categories by learners of 

English (Unpublished doctoral dissertation, University of Cambridge). 

https://lcr2013.w.uib.no/files/2013/09/Nickalls-poster.pdf
https://doi.org/10.1002/9780470756492.ch21
https://dx.doi.org/10.11919%2Fj.issn.1002-0829.215010


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

146 
 

Wang, W. (2011).  A content analysis of reliability in advertising content analysis 

studies.  Electronic Theses and Dissertations, p.1375. 

http://dc.etsu.edu/etd/1375  

Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater 

consistency in assessing oral interaction. Language Testing, 10, 305–335. 

Yamada, J. (1982). The use of the English articles among Japanese students. 

RELC Journal, 13(1), 50-63. 

Zdorenko, T. & Paradis, J. (2008). The acquisition of articles in child second 

language English: fluctuation, transfer or both?, Second Language Research, 

24(2), 227-250. 

 
http://dc.etsu.edu/etd/1375


LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 
 

147 

 
Appendix A 

A. Children’s Story Picture Sequence 

  
B. Teenager’s Story Picture Sequence 

 
LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, April 2021 

148 
 

C. Adults Story Picture Sequence