LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 LLT Journal: A Journal on Language and Language Teaching http://e-journal.usd.ac.id/index.php/LLT Sanata Dharma University, Yogyakarta, Indonesia 412 THE EFFECTS OF NARRATIVE AND ARGUMENTATIVE MODES ON ASSESSING LEARNERS’ WRITTEN PERFORMANCES BASED ON THE ANALYTIC RATING SCALE Rania Zribi and Chokri Smaoui Sfax University, Tunisia raniazribi@ymail.com; smaoui2002@yahoo.com correspondence: raniazribi@ymail.com DOI: 10.24071/llt.v24i1.2986 received 19 November 2020; accepted 29 October 2021 Abstract This study aims at investigating the effects of discourse modes on assessing EFL learners’ written performances. A total of fifty raters judged sixty essays (30 narratives and 30 argumentative writing modes) written by third-year English students from the Faculty of Letters and Humanities. Raters not only scored the compositions but also justified their scores’ assignments based on written explanations. Raters’ rating behaviors were diagnosed based on a variety of quantitative and qualitative tools. Essay scores were analyzed based on the statistical model FACETS to measure raters’ severity and internal consistency, task difficulty, and the scale functioning across writing modes. Qualitative data (gathered from interviews and report forms) were also analyzed in order to examine which aspects of writing were deemed more important than others across task types. The analysis revealed that the discourse mode was substantially an influential factor. The narrative task was more difficult than the argumentative one. Narrative essays were judged harsher than argumentative essays. Less consistent ratings could be detected from the narrative mode, compared to the argumentative one. Qualitative findings showed that the two writing modes were different in their qualitative judgments due to their different genre requirements and norms. Keywords: discourse modes, scoring, rating scale, FACETS, scores’ variability Introduction Academic writing is a crucial communicative skill in English as a first language (L1), second language (L2), and foreign language (EFL) teaching and learning instructions. It is a sophisticated “form of thinking” (Zinsser, 1988 p. vii), in which the writer has to perform different actions simultaneously, such as planning, organizing, writing, revising, editing, and publishing (Weigle, 2002 p.4) to produce coherent and accurate performance. The mastery of this complex skill is essential for university students, who are required to develop their writing abilities at this level through their cognitive process of not only constructing meaningful knowledge but also transmitting messages to their readers based on academic essays. LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 413 Apart from its complexity in language teaching, the writing skill also seems difficult to be tested by EFL teachers. In this respect, Bizzell (1987) focuses not only on the complex nature of the writing activity, but also on the extreme difficulty of assessing it, especially with the presence of human raters, whose great deal of subjectivity constituted one of the perennial problems related to direct writing assessments (p.583). While reviewing the literature, it has been noted that the same learners’ written performances were assessed subjectively by different raters despite the use of the same rating scale with its well-defined rating criteria, resulting in inconsistencies that would threaten not only reliability but also validity in the writing assessment context. Raters have different perceptions of what constitutes a good writing sample. What is appraised by one rater is downplayed by another. In justifying their scores’ assignments, raters may overlook some mistakes while others may magnify them in measuring students’ language skills. On that account, raters’ potential scores’ variability and divergent rating judgments can be due to various factors related mainly to raters, writing modes, rating scales, rating criteria…etc. Out of the myriad influential sources of scores’ variability, this paper attempts to focus on the task variable, because as Barkaoui (2008) claims, “task characteristics can also influence rater performance and reliability’’ (p.12). It is thus proposed, in this work, to provide a deep analysis of the effect of task types on raters’ quantitative and qualitative judgments. Our main intention is to investigate and the way raters assess learners’ written performances based on a well-defined analytic rating criteria. Our primary goal is to analyse the possible discrepancy in raters’ judgments of narrative and argumentative writing modes in the analytic rating scale, by taking into consideration not only their severity and internal consistency rates but also the difficulty estimates of the two writing task types and the scale functionality. This research also will focus on the writing aspects that attracted the attention of raters in evaluating the same test takers’ narrative and argumentative essays analytically. The current study addresses the following research questions: 1. What are the effects, if any, of narrative and argumentative tasks on raters’ severity and internal consistency based on the analytic rating scale? 2. To what extent do narrative and argumentative tasks vary in terms of their difficulty estimates? 3. Do writing modes influence the functionality of the analytic rating scale? 4. Do different task types affect the raters’ scoring behaviors and the aspects of writing they attend to, based on the analytic rating scale? Review of the literature Discourse mode, a task characteristic that could potentially influence the assessment of EFL learners’ writing performances, is of particular interest in the present study. A crucial issue pertaining to the evaluation of writing proficiency is scores variation among raters due to different variables, mainly the tasks variable. The latter should be controlled in testing writing skills to allow learners to generate their best performance and to ensure valid and reliable scores. In the realm of academic writing assessment and scores variation research, the effect of prompt types on raters’ scores has been profusely investigated by researchers and LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 414 specialists in the field (Engelhard et al., 1992; Kegley, 1986; Kuhlemeir, et al., 1995; Quellmalz, et al., 1982; Sachse 1984). Several empirical researches have pointed to the conceivable impacts of different task requirements and modes on raters’ scoring behaviors (Cumming et al. 2002; Weigle, 1999) and the reliability of their scores assignment (Tedick, 1990). In this context, Stifler (2002) maintains that “modes of writing, or rhetorical modes are patterns of organization aimed at achieving a particular effect in the reader” (p.1). This idea was encapsulated by White (1982) in his claim that “we know that assigned mode of discourse affects test score distribution in important ways. We do not know how to develop writing tests that will be fair to students who are more skilled in the modes not usually tested” (p.17). To stress the inter- relation of both social and educational settings in language communication, Weigle (2002) focused on the effect of writing tasks and contextual factors on tests scores. She claims that “any assessment takes place in a given social and cultural context and may not be generalizable outside that context” (p.60). In this regard, Oxford (1996) states that “when language learners are asked to tell their histories, they inevitably address contextual, situational, cultural factors as part of the story of their learning” (p.582). Thus, Context has emerged as a vital theme in the educational system. A possible source of scores variation examined in this paper after measuring EFL learners’ essays was the discourse mode facet (narrative vs. argumentative tasks). The chief aim of previous research is to compare the raters’ scores assignment to two or more writing modes to extract their points of similarities and differences in essays measurement. The research finding of Kegley’s (1986) study can be used to illustrate the considerable effects of discourse mode on the assessment of the writing competence. She perceived differences between the mean score of a narrative sample and marks for descriptive, expository, and persuasive samples. The narrative essays received the highest marks while the persuasive essays received the lowest marks (p.147). In one of the studies investigating the notable effect of discourse mode on writing scores assignment task, Engelhard et al. (1992) proclaim that “narrative writing tasks received the highest ratings, with descriptive writing tasks receiving the next highest rating, and expository writing tasks receiving the lowest ratings” (p.329). In examining raters’ scores to two different discourse modes, the findings of Carrell’s (1995) study were condensed to denote higher holistic scores assigned to the narrative essays than to the argumentative essays produced by the same writers (p.175). In contrast, Quellmalz et al. (1980) obtained nearly opposite results in examining the relationship between two discourse modes and raters’ scores to learners’ compositions. They found that scores assigned to narrative essays were lower than those given to expository essays based on a five-point holistic rating rubric. Raters’ variability can be due to their tendency to rate narrative mode of discourse in a stringent way or to the examinees’ lack of knowledge or to their curricula requirements (p.13). Moreover, a strong correlation can be detected in the scores assigned to two essays produced in the same mode of discourse. However, that was not the case with the scores awarded to two essays in different discourse modes (p.13). In another study conducted to diagnose the effect of discourse modes on raters’ testing or writing quality, Quellmalz et al. (1982) concluded that “levels of performance vary on tasks presenting different writing LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 415 purposes” (p.255). Hence, this divergence in scoring the two different discourse modes could be attributed to task requirements, as each task requires different writing skills, leading to construct-relevant variance, which causes aberrations in raters’ score assignment tasks. Method Research design overview This study adopted a cross-sectional design to gather sufficient data from the examinees and their raters at a single point in time. Its chief aim is to analyse and interpret English raters’ evaluative behaviors and scores’ assignments when testing EFL learners’ writing responses to two different narrative and argumentative discourse modes. Hence, both teachers and students took part in this empirical research. A comparative pattern was also incorporated in this study to extract the differences and similarities in the scores and judgments assigned by raters to EFL test takers’ writing samples on two different writing modes based on the analytic rating scale. To advocate the efficiency of the comparative design in analysing the study outcomes in the language testing field, Collier (1993) argued that “comparison is a fundamental tool of analysis. It sharpens our power of description, and plays a central role in concept-formation by bringing into focus suggestive similarities and contrasts among cases” (p.105). Participants A total sample of thirty EFL learners voluntarily took part in this study. They are, a representative sample of a large population, enrolled in the third year English class level. These students were mostly females with a mean age of 22. They were under-graduate third-year English students, who have been specialized in the target foreign language for three years at the tertiary level and whose proficiency levels vary. All the test takers were non-native speakers of English and students in the English department at the xxx university. In addition, a panel of fifty writing teachers of English as a foreign language participated in this phase. They represented a mixed sample of male and female raters with an average age of 45 and belonged to different L1 backgrounds. Their first language is Arabic while English is their dominant work language in tertiary education in different Tunisian universities; they are specialized in teaching English as a Foreign language to EFL under-graduate learners. At the time of my data collection process, all third year learners were attending their English classes and lessons. From the third-year class, I selected randomly female and male students to sit for two separate one-hour task-based writing performance tests on two different testing occasions to respond silently to two different discourse modes, viz. narrative and argumentative prompts. Thus, each test taker produced two writing samples, to come up finally with a total of sixty essays. In the first task, each test taker is required to write an essay in which he narrates the way he has helped his family to solve a family problem. On the other hand, in the argumentative task, candidates were asked to provide their arguments to convince the reader about the assets and drawbacks of using technology in our society. The next step consisted in collecting the examinees’ writing performances. To control the effects of such variables as handwriting, these samples were typed, LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 416 without changing or removing the original mistakes. As it was neither a part of the learning objectives of the writing course at the third year level nor mentioned in the analytic rating rubric that raters relied on, handwriting was not among the writing aspects to be tested in the data collection phase. The names of the students, who composed the two essays were also removed and replaced with just numbers in order to minimize potential bias rates. The sixty narrative and argumentative essays (thirty essays in each task) were sent to fifty raters to judge and score their quality based on the same rating instruments and procedures. Procedures Test takers were instructed to generate an essay after responding to each task requirements on two different testing occasions. First, each candidate produced an essay after responding to the narrative prompt, then, within a one-week period, he responded to the argumentative writing task by generating an argumentative sample. The time allowed for each essay production was one hour to enable students to understand the topic, analyze it and generate a coherent writing sample in an authentic testing context. The present study used two different essay prompts, which vary in terms of their characteristics, notably content, structure, and wording. They are designed to evaluate EFL examinees’ abilities to perform coherent and well-structured academic writing samples. Written production was prompted via computer and then sent to raters to judge students’ writing proficiency in the two discourse modes based on the same analytic rating scale. In this respect, the ESL Composition Profile designed by Jacob et al. (1981), was applied in this study to test written productions. This analytic scale was originally constructed for large-assessment purposes to test multiple composition samples of English as a second language. It comprises five different criteria, namely content, organization, vocabulary, language use, and mechanics (See Appendix A). Since this study has embarked on examining raters’ scoring patterns of EFL learners’ writing performances in both narrative and argumentative discourse modes, a mixed-methods triangulation design of both quantitative and qualitative approaches was applied to gather and report data about the raters’ decision-making while rating under-graduated learners’ essays. Quantitative data were extracted from different procedures. Analytic score report forms were employed to come across raters’ scores assignment and their decision-making process after assessing EFL learners’ writing compositions (See Appendix B). These analytic scores awarded by raters to the same test takers’ narrative and argumentative essays were analysed based on two statistical programs: SPSS and FACETS (version 3.80.0). The latter permits researchers to add as many facets as they need, such as raters, rater groups, tasks, students, rating scale, rating criteria, and so on depending on the purpose of each study. In this vein, Schaefer (2008) highlights the prominent value of this model by stating that “it has shown great promise in the area of performance assessment and rating scale validation because it can analyze sources of variation in test scores besides item difficulty or person ability” (p.466). Prior to the analysis and interpretation of raters’ judgments, the facets used in this study were coded. The following figure presents the relevant facets related to this study in the data collection phase. LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 417 A profile questionnaire about raters’ qualifications, and personal and professional data was also adopted (See Appendix C). These statistical procedures are used in order to examine the effects of evaluating different narrative and argumentative writing modes on raters’ severity and internal consistency, task difficulty, and rating scale’s performance. Furthermore, report writing forms were conducted to gather qualitative data about raters’ judgments of EFL test takers’ writing responses to two different narrative and argumentative prompts. Raters’ reading and assessing strategies were thus elicited based on explaining their rating patterns during the evaluation process. Findings and Discussion Analyzing raters’ quantitative judgments across tasks Both FACETS and SPSS statistical outcomes across task types were reported in this section to analyse raters’ scoring behaviors based on their analytic marks assigned to the same candidates’ narrative and argumentative essays. Rater severity FACETS analysis revealed clear differences in raters’ severity levels after assessing narrative and argumentative performance analytically. Measuring rater severity/ leniency levels on a logit scale, centred at 0, spanned 4.09 logits, from the most lenient rater located at -2.03 to the most severe rater located at 2.06 for Third-year English level Relevant facets related to the present study Test takers (n=30) Rater groups (n=50) Discourse modes Rating scale Rating categories University teachers of English Narrative Argumentative Analytic 1.Content 2.Organization 3.Vocabulary 4.Language use 5. Mechanics Figure 1: A schema of the relevant facets of assessment to this study LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 418 the narrative task and a 3.32 logit spread, with the most lenient rater at -2.30 to the harshest rater at 1.02 for the argumentative task. Comparing the range of raters’ severities across task types, the table shows raters’ variations. It is clear from table 1 that raters differ in their severity estimates. They graded the narrative essays more severely than the argumentative essays (A span of 4.09 for the narrative task is smaller than the 3.32-logit spread for the argumentative task). This may be because of the judges’ tendency to mark narrative essays more stringently (Quellmalz et al. 1982 p.13) or due to their perceptions of the writing task difficulty and their attempts to adjust scoring behaviors accordingly. These results were at variance with Engelhard et al’. (1991) study, which showed that “writing tasks that require more personal responses (direct and imagined experiences) tend to elicit essays that receive higher ratings than writing tasks that require impersonal or outside knowledge” (p.19). To test the significance of these raters’ different severity levels across tasks, FACETS output generated three indices, namely the separation statistics, chi- square with its p-value and the reliability of separation. Table 1: Summary of Rater Measurement Report by Discourse Modes (based on the analytic scoring procedure) Narrative Mode Argumentative Mode Rater Severity M (Model SE) .13 .13 SD (Model SE) .00 .01 Min -2.03 -2.30 Max 2.06 1.02 Infit M 1.0 1.0 SD .25 .31 Outfit M 1.01 1.0 SD .25 .33 Separation Statistics Separation Ratio (G) 8.75 7.85 Separation Index (H) 12.00 10.80 Reliability of Separation .99 .97 Fixed chi-square statistics 3519.0 2868.8 df. 49 49 Significance .00 .00 Inter-rater agreement opportunities 90000 90000 Exact agreement % 37.8% 34044 38.4% 34516 Expected agreement % 35.9% 32305.2 36.8% 33129.3 Analyzing raters’ analytic scoring decisions across the two tasks was based on FACETS outcomes as illustrated in the table above. The item separation ratios (G) were 8.75 for the narrative task and 7.85 for the argumentative task, indicating that the variance among scorers was approximately nine times higher than the error of estimates, especially for the narrative essays, thus suggesting that graders were not equally severe. The prompt separation index (H) was 12.00 for the LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 419 narrative essays and 10.80 for the argumentative essays, indicating that raters can be divided into about twelve severity levels in the narrative task and eleven levels in the argumentative task. Separation statistics, separating raters into more distinct severity levels for the narrative prompt (twelve levels) than for the argumentative prompt (eleven strata of severity), were fairly reliable due to high reliability of separation indices of .99 for the narrative task and .97 for the argumentative task. Thus, raters showed significant notable differences in the levels of severity they exercised for the two tasks, with a fixed chi-square value of 2868.8 for the argumentative task and 3519 for the narrative task (degree of freedom = 49) and a significant p-value at .00 (p ˂ .005). The null hypothesis that all scorers were equally harsh in their scores’ assignment to the candidates’ narrative and argumentative writings must be rejected. Measuring inter-rater reliability rates in assigning marks to the students’ narrative and argumentative essays was based on inter-rater agreement statistics. As table 1 demonstrated, out of 90000 possible opportunities for agreement, the numbers of exact agreements between raters were 34044 (37.8%) for the narrative essays and 34516 (38.4%) for the argumentative essays, while the expected ones were 32305.2 (35.9%) and 33129.3 (36.8%) for the narrative and argumentative writings respectively. The observed exact agreements (37.8% and 38.4%) were higher than the expected percentages (35.9%) and 36.8%). This explained the fact that raters did not judge their test takers’ performance in an independent way. Task types remain an influential factor in assessing learners’ writing skills in different writing modes. Rater internal consistency A more detailed analysis of raters’ internal consistency in judging examinees’ narrative and argumentative performance is based on fit statistics in the rater facet. A preferable infit mean-square value of 1.00 was perceived in the two prompts, suggesting intra-rater agreement between raters in assessing both tasks based on the analytic rating scale. They not only employed the analytic rating scale consistently but also maintained their severity levels across the two tasks. Their internal consistency in measuring learners’ writing modes could be explained by the fact that their scores assignment fitted perfectly the Rasch model predictions. Little variation however can be detected in the outfit mean-square values of 1.01 for the narrative task and 1.0 for the argumentative task. To further investigate raters’ internal consistency across discourse modes, the same three- class fit pattern of overfit, acceptable fit, and misfit was analysed to differentiate between raters in terms of their intra-rater agreement rates in measuring examinees’ samples. The following table exhibits the frequencies of raters’ consistency in the analytic assessment of the same set of narrative and argumentative essays. LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 420 Table 2: Frequencies of Rater Fit Statistics across Discourse Modes (based on the analytic scoring procedure) Narrative Mode Argumentative Mode Fit Range Infit MS Outfit MS Infit MS Outfit MS Overfit: Fit ˂ 0.70 3 (6%) 5 (10%) 8 (16%) 9 (18%) Acceptable Fit: 0.70 ˂ Fit ˂1.30 36 (72%) 32 (64%) 40 (80%) 39 (78%) Misfit : ˃ 1.30 11 (22%) 13 (26%) 2 (4%) 2 (4%) Out of the fifty raters, thirty-six (72%) exhibited acceptable infit estimates in measuring the test takers’ narrative performance, while forty (80%) displayed acceptable consistency rates in testing the argumentative writings based on FACETS output. Raters thus showed slightly higher intra-rater agreements in rating the argumentative tasks as compared to the narrative tasks. There were more overfitting raters (n= 8, 16%) in judging the argumentative essays compared to the narratives (n= 3 representing 6%), indicating little variation between scores’ assignment process to the two tasks and the FACETS expected scores. More misfitting raters (n= 11) appeared in scoring the narrative essays (22%) than in marking the argumentative essays (n= 2 representing 4%), suggesting much variability in the marks awarded to the narrative samples. Misfitting raters, whose different rating behaviors did not fit the model, threatened scores validity in the testing field. A small number of misfitting raters across tasks appeared in the analytic ratings (4% for the argumentative task and 22% for the narrative task). Based on these outcomes, raters were more consistent than the model predicted in scoring the argumentative essays, compared to the narrative essays. This can be attributed to the open-ended personal nature of narrative essays, which are difficult for raters to judge consistently, leading to such unwanted scores’ variations and inconsistent rating behaviors. The null hypothesis stating that raters across two distinct discourse modes showed the same severity and internal consistency levels in scoring the same test takers’ writings based on the analytic rating scale must be rejected. Prompts difficulty After analysing the rater facet in terms of severity measures and internal consistency estimates, it is crucial to focus on the task facet, as one of the variables in the current study by taking into account both the difficulty estimate parameter and fit statistical indices for each task. The former was applied to measure the difficulty levels of the two tasks, while the latter was used to test the consistency of measuring these tasks difficulty rates. Task average difficulty is set at 0 logit by convention. Table 3 illustrates prompt difficulty measures for both discourse modes resulting from the analytic ratings. LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 421 Table 3: Prompt difficulty estimates (n=2) (based on the analytic scoring procedure) Observed Average Fair Measure Average Measure Model SE Infit MS Outfit MS Narrative 2.56 2.58 .14 .02 1.04 1.05 Argumentative 2.68 2.71 -.14 .02 .96 .96 Mean (n= 2) 2.62 2.65 .00 .02 1.00 1.00 SD .09 .09 .20 .00 .05 .06 RMSE .02 Adj (True) S.D. 20 Separation 11.36 Reliability .99 Fixed (all same) chi-square: 130.0 d.f.: 1 significance (probability): .00 As can be drawn from table 3, the fair measure average for the narrative task (2.58) was less than the fair measure average for the argumentative (2.71) on the four-point analytic scale, which demonstrated that the narrative task was more difficult, as compared to the argumentative. To measure the underlying difficulty of the two prompts, a logit difficulty of .14 (SE = .02) for the narrative task was higher than the difficulty span of the argumentative task (-.14 with SE = .02), which indicated that it was more difficult to get a high score on the narrative task than on the argumentative one in grading essays based on the analytic rating rubric. The narrative task appears to be more difficult, compared to the argumentative task in the analytic rating scale. This can be explained by the fact that test takers are expected to perform their stories by narrating past experiences with special attention to correct language and rhetorical aspects of language to form coherent and fluent narrative flow. This personal open aspect of narrative flow reflects its difficulty for students and raters alike. To test the significance of the different levels of difficulty between the narrative and argumentative prompts, the fixed chi-square test with its p-value were underlined. The fixed chi-square yielded a value of 130.0 with 1 degree of freedom and a significant p-value (= .00), which rejects the null hypothesis that the two tasks are equal in difficulty based on the analytic scorings. Analytic scale functioning From FACETS output, test takers’ ability measures can be extracted to examine the functionality of the analytic rating scale with its five categories across discourse modes. The adequacy of the analytic rating scale with its five rating categories can be measured based on the threshold (step) calibration statistics generated from FACETS. For instance, concerning the content criterion, a test taker whose ability estimates was -2.11 for the narrative mode and -2.36 for the argumentative mode had a probability of 50% to be scored as either 2 or 3. Testing the scale functioning is also based on the way the category thresholds were ordered. According to table 4, the content category thresholds for example were ordered in an ascending order from -2.11 to 2.13 for the narrative mode and from -2.36 to 2.72 for the argumentative mode in the four score levels. The thresholds measures in the five rating categories increased monotonically as the score levels advanced. The distance between the thresholds was also adequate in the five rating categories, as they advanced by at least 1.4 logits from one score level to another, but did not exceed 5 logits. This indicates that the analytic rating categories were not only ordered but also functioned as expected by the model as LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 422 the thresholds advanced monotonically with the analytic scale levels. This stressed the functioning of the analytic scale. Table 4: Threshold Estimates for the Analytic Scoring Categories across Tasktypes Aspects of writing attended to across task types The table below presents the percentages of referring to each rating category in the analytic report forms after assessing learners’ narrative and argumentative essays. It shows that overall, raters explained their scores based not only on each rating category of the analytic scale but also on other writing aspects not mentioned in the scale. The highest percentages pertained to both content (28.67% for narrative essays and 27.82% for argumentative essays) and organization (about 22% for narrative essays and 28% for argumentative essays) across task types, while the lowest concerned mechanics (about 6% for narrative essays and 5.80% for argumentative essays) and vocabulary (about 12% for narrative essays and 10% for argumentative essays) criteria. Other writing aspects were more reported in the narrative mode (15.32%) than in the argumentative mode (12.40%). Additionally, raters reached approximately the same percentage (16% and 15.64%) in using the language use aspect across the two writing modes. Based on the frequency of raters’ comments across the five rating criteria to the narrative and argumentative essays, we can deduce that both writing modes reported the five rating criteria together with other writing aspects based on the same order of importance but with slight differences in terms of their percentages. Hence, content, organization, language use, other aspects, vocabulary and mechanics were mentioned in both tasks with some differences in frequency ranges. Table 5: Frequencies for Aspects of Writing in Report Forms across Task Types Content Org Vocab Lg Use Mechanics Other Asp Narrative Mode 28.67 21.91 11.91 16 6.17 15.32 Argumentative Mode 27.82 27.93 10.41 15.64 5.81 12.40 The Narrative Mode Score Levels Content Org Vocab Lg Use Mech 1 None None None None None 2 -2.11 -1.79 -2.37 -2.29 -1.91 3 -.02 -.38 -.11 .09 -.09 4 2.13 2.17 2.48 2.21 2.00 The Argumentative Mode Score Levels Content Org Vocab Lg Use Mech 1 None None None None None 2 -2.36 -1.86 -2.52 -2.32 -1.89 3 -.36 -.51 -.28 -.15 -.12 4 2.72 2.38 2.80 2.47 2.01 LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 423 To scrutinize whether the differences in percentages of decision-making strategies and aspects of writing reported in the analytic report forms across task types were significant, I conducted a Wilcoxon Signed-Ranks test for each rating category. The table below portrays the statistical significance of the difference in using the six rating criteria across narrative and argumentative tasks based on the non-parametric Wilcoxon signed-ranks test. The Wilcoxon test indicated that only the difference for organization was significant as p-value is below the threshold level of 0.05 (p = .000). The p-values of each of the other five rating categories, being .748, .169, .922, 724, and .009 are all above the threshold of 0.05. Therefore, it was clear that the use of the five scoring criteria was not significantly different across writing prompts. Different task types did not significantly influence the raters’ score assignment and decision-making tasks. Table 6: Test Statistics across narrative and argumentative tasks The above quantitative data analysis explicates the effect of narrative and argumentative writing modes on raters’ scoring patterns based on the analytic rating rubric. The statistical FACETS outcomes, highlighting the impact of discourse modes on the scores’ assignment task, showed raters’ different decision making processes across task types, which will be further examined in the qualitative part. This section clarifies the aspects of writing that raters attended to across the narrative and argumentative tasks, by analyzing report form explanations, associated with the writing modes variable. Analyzing raters’ qualitative judgments across tasks During the assessment process, the fifty scorers were required to explain their assigned marks to the sixty students’ performance by writing their remarks in the analytic report forms. Raters’ comments were then classified across narrative and argumentative writing modes. This classification helps us to compare raters’ explanations across tasks, by examining which aspects of language may attract the raters’ attention in evaluating narrative and argumentative writings. This qualitative analysis thus investigates the nature of the raters’ scoring behaviors across tasks. A scrutiny of raters’ feedback suggested that raters’ score explanations to narrative performances outnumbered argumentative essays based on the four-point analytic rating scale with its five different rating criteria. This Test Statisticsa Content Org Vocab Lg Use Mechanics Other Aspects Mann- Whitne y U 457440,000 402240,000 448800,000 459840,000 458400,000 436320,0 00 Wilcox on W 918720,000 863520,000 910080,000 921120,000 919680,000 897600,0 00 Z -,322 -5,568 -1,376 -,098 -,353 -2,611 Asymp. Sig. (2- tailed) ,748 ,000 ,169 ,922 ,724 ,009 a. Grouping Variable: Task Type LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 424 can be due to the difficulty of the narrative task and to the test takers’ inability to generate coherent and accurate narrative performances. In terms of content, judges were aware of genre stipulations and topic requirements across tasks. Although clear explanations, examples and details enhance ideas elaboration and arguments development, different task-specific functional elements affected raters’ judgments of both tasks. Judges referred to the problem, the solution and the consequence in measuring narrative essays, by commenting on chronological story events, its plot and climax. Raters, however, pointed to issue, evidence, claim, counter argument and support in argumentative essays, by focusing on the development of the thesis and the anti-thesis and parallelism between both parts of argumentation. In terms of focus, raters directed their attention to the use of quotations in the narrative mode and references in the argumentative mode. It seems that content affects not only coherence but also organization in narrative prompts. In terms of organization, raters provided comments on the conventional five- paragraph essay structure format with respect to both task types. Despite the importance of paragraphing in the two modes, argumentative essays were evaluated based mainly on the specific components of an introduction. Raters valued the presence of motivator, background information, thesis statement and blueprint in producing a well-structured argumentative introduction. They even focused on topic sentences in each controlling paragraph. They extended their comments to focus on recommendations in the conclusion. For the narrative tasks, however, raters pointed to the effect of fragmented ideas on developing the narrative flow of events, leading to unbalanced narrative performance, which in its turn resulted in an incoherent story. Both connection of ideas and transition between the different essay parts are pertinent in the narrative and argumentative productions. These divergences can be attributed to the different narrative and argumentative structures and organizational components and essays’ format. Some noticeable vocabulary differences between task types appeared in raters’ analytic scores explanations. While raters commented on vocabulary sophistication and variation across the two tasks, they made reference to the test takers’ word and form choices, especially verbs, adjectives, and adverbs to express their ideas clearly in narrative essays. Such comments were related to word placement, comparative and plural forms to formulate well-structured argumentative essays. Narrative tasks are associated with the informal register while argumentative tasks are related to the formal register. Differences in the way language use is valued by raters across tasks are also examined. The use of accurate language was commented on in the two tasks. Sentence constructions were a prevailing aspect in foreign language productions due to their complexity and forms variations. In response to narrative tasks, scorers referred to clear structures and expressions, similes, comparative forms, and parallel sentences, whereas in argumentative essays, they focused on word order and placement, and accurate well-formed sentences that reflect clear arguments. Conjunctions, long sentences, run-ons, fragments, prepositions, articles, pronouns and active voice can be found in both genres. A colloquial informal style was attributed to narrative essays, which was not the case with argumentative essays. More lexical and language mistakes appeared in the narrative task, compared to the argumentative task. Narrative essays are normally written in the simple past as test takers narrate their past experiences or LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 425 fictionalize stories, while argumentative tasks should be performed in the simple present tense. This seems to be due to the specific-task based assessment task. Raters may have some stylistic, syntactic, and lexical preferences associated with each writing mode. Raters also appeared to focus on mechanics in measuring examinees’ narrative and argumentative performances. Based on the analytic report forms, the majority of raters’ comments were negative, indicating serious problems in capitalization, punctuation, linkers and spelling in both tasks. These serious mistakes led to awkward structures, unclear sentence boundaries and incoherent text, which hindered the quality of the essay. After dealing with all the rating categories within the analytic rating scale, we noticed more comments related to writing aspects other than those mentioned in the rubric; content, organization, vocabulary, language use and mechanics. This can be explained by the fact that raters did not stick to the analytic rating scale during their measurement process of the sixty narrative and argumentative essays. This can be traced back to the possible specific task features that may attract raters’ attention. In this respect, raters may focus on some narrative or argumentative aspects over others. In assessing examinees’ narrative writing abilities, raters started by expressing their overall impressions of the whole piece of writing, by taking into account the originality of students’ narrative academic essays. In judging argumentative essays however, raters directed their attention to overall structure and relevance rather than to its original arguments. Narrative writings received more score explanations in grammatical and stylistic writing aspects, compared to argumentative essays. Raters commented on test takers’ use of contractions, modals, pronouns and verbs in their narrative performance. They even pointed to tense problems as some test takers used the simple present in narrating events. Raters were even more interested in perceiving stylistic features in both tasks. They referred to oral-like and colloquial narrative style of writing, which was not the case for formal argumentative essays. Argumentative style could be hampered by plagiarism. Judges also pointed to the effects of language interference and translation on both narrative and argumentative writings. Thus, differences pertaining to the above mentioned categories would appear to call for separate rating scales, related to task-specific feature and raters’ focus, which had been the most frequently investigated in foreign writing assessment. Raters’ cognitive processes and rating behaviors in narrative and argumentative tasks were different depending on their scoring approaches and their treatment of the rating criteria and other relevant writing aspects. These qualitative outcomes were in line with Quellmalz et al. (1982), who stated that “the different subskills included in the scoring rubric seem definitely to interact with discourse mode and, at the same time, to varying degrees are independent sources of variation in student writing performance” (p.20). Conclusion This mixed-method research examined the effects of discourse modes (narrative and argumentative essays) on assessing EFL learners’ writing skills. The narrative essays were rated significantly harsher than the argumentative essays. In terms of raters’ internal consistency across task types, the narrative writings led to a higher proportion of judges with fit statistics within the misfit LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 426 range, whereas the argumentative writings resulted in a higher proportion of scorers with overfit. Argumentative ratings were thus more consistent than narratives. The statistical program FACETS revealed that the narrative prompt was more difficult than the argumentative prompt. The qualitative outcomes were also complementary. Raters attended to different writing aspects across task types. Hence, the qualitative assessments were divergent across task types, indicating that raters held different conceptions of what constitutes a good writing. The major findings of the current study can direct our attention to different implications. These immutable differences across tasks may require the use of task-based rubrics, related to the specificities and characteristics of each writing genre, one for the narrative and one for the argumentative. It is also important to highlight the necessity for test developers to use different task types in any EFL writing assessment context in order to gather various writing samples, which represent each test takers’ writing ability and thus ensuring valid results “… by giving a broader basis for making generalizations about a student’s writing ability” (Read 1991 p.87). The analytic scale is also recommended as it might be useful for diagnostic and placement aims in high-stake writing assessments to ensure valid and reliable outcomes. As the present study shows, raters varied in their severity and consistency levels in their scores’ assignment across tasks. One strategy to improve this limitation is to apply the multi-faceted Rasch measurement model (FACETS) to adjust raters’ marks and enhance raters’ inter and intra-rater reliability in judging different task types. This statistical program, as Prieto and Nieto (2014) claim, allows the “analysis of the actions of different raters on different tasks, and enables us to determine, in part, whether the scoring categories appearing on rubrics must be adjusted or changed in order to obtain more consistent or valid scores” (p.386). References Barkaoui, K. (2008). Effects of scoring method and rater experience on ESL essay rating processes and outcomes. Unpublished doctoral dissertation. University of Toronto, Canada. Bizzell, P. (1987). Review: What can we know, what must we do, what may we hope: writing assessment. College English, 49(5), 575-584. Carrell, P., L. (1995). The effect of writers’ personalities and raters’ personalities on the holistic evaluation of writing. Assessing Writing, 2(2), 153-l90. Collier, D. (1993). The comparative method. In A. W. Finifter (Ed.). Political Science: The State of the Discipline 2. American Political Science Association. Cumming, A., Kantor, R. & Powers, D., E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67-96. Engelhard Jr, G., Gordon, B., & Gabrielson, S. (1991). Writing Tasks and the quality of student writing: Evidence from a statewide assessment of writing. Paper presented at the Annual Meeting of the American Educational Research Association. Engelhard, G., Gordon, B., & Gabrielson, S. (1992). The influences of mode of discourse, experiential demand, and gender on the quality of student writing. Research in the Teaching of English, 26, 315-336. LLT Journal, e-ISSN 2579-9533, p-ISSN 1410-7201, Vol. 24, No. 1, October 2021 427 Huot, B. (1990). The literature of direct writing assessment: major concerns and prevailing trends. Review of Educational Research Summer, 60(2), 237-263. Jacob, H., Zinkgraf, S., Wormuth, D., Hartfiel, V. F. & Hughey, J. (1981). Testing EFL cmposition: A practical approach. rowley. MassNewbury House. Kegley, P., H. (1986). The effect of mode discourse on student writing performance: Implications for policy. Educational Evaluation and Policy Analysis, 8(2), 147-154. Kuhlemeier, H., van den Bergh, H., & Wijnstra, J. (1995). Multilevel factor analysis applied to national assessment data. Paper presented at the Annual meeting of the American Educational Research Association, San Francisco. Oxford, R., (1996). When emotion meets (meta) cognition in language learning histories. The Teaching of Culture and Language in the Second Language Classroom, 581-594. Quellmalz, E., Capell, F, & Chou, C., P. (1982). Effects of discourse and response mode on the measurement of writing competence. Journal of Educational Measurement, 19, 241-258. Quellmalz, E., Capell, F., J., & Chou, C., P. (1980). Defining writing: Effects of discourse and response mode. CSE Report No.132. University of California. Sachse, P. (1984). Writing assessment in Texas: Practices and problems. Educational Measurement: Issues and Practice, 3, 21-23. Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25, 465-493. Stifler, B. (2002). Rhetorical modes. Tedick, D., J. (1990). ESL writing assessment: Subject-matter knowledge and its impact on performance. English for Specific Purposes, 9, 123-143. Veal, L. R. & Tillman, M. (1971). Mode of discourse variation in evaluation of children’s writing. Research in the Teaching of English, 5(1), 37-45. Weigle, S., C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), pp. 145-178. Weigle, S., C. (2002). Assessing writing. Cambridge University Press. Weir, C., J. (1990). Communicative language testing. UK: Prentice Hall. White, E., M. (1982). Some issues in the testing of writing. Notes from the National Testing. Network in Writing. New York: Instructional Resource Center of CUNY. Yunick, S. (1997). Genres, registers and sociolinguistics. World Englishes, 6(3), pp. 321-336. Zinsser, W., K. (1988). Writing to learn. New York: Harper & Row.