JOLLT Journal of Languages and Language Teaching https://e-journal.undikma.ac.id/index.php/jollt/index Email: jollt@undikma.ac.id DOI: https://doi.org/10.33394/jollt.v%vi%i.7481 April 2023. Vol.11, No.2 p-ISSN: 2338-0810 e-ISSN: 2621-1378 pp. 225-237 JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |225 EVALUATING THE QUALITY OF A TEACHER’S MADE TEST AGAINST FIVE PRINCIPLES OF LANGUAGE ASSESSMENT 1*Dedi Sumarsono, 1Moh. Arsyad Arrafii, 1Imansyah 1English Language Education, Mandalika University of Education, Indonesia *Corresponding Author Email: dedisumarsono@undikma.ac.id Article Info Abstract Article History Received: March 2023 Revised: March 2023 Published: April 2023 Classroom assessment plays a dual role in both summative and formative functions, aiming to gather evidence, evaluate, and improve student learning. To ensure the accuracy and authenticity of assessment evidence, effective assessment instruments, such as tests, are essential for obtaining valid and reliable evidence of student learning. However, it is important to acknowledge that teachers may lack the necessary theoretical grounding to design and develop assessment instruments that align with sound language assessment principles. Consequently, this research study seeks to evaluate classroom-based assessment instruments, specifically language tests, against five fundamental principles of language assessment. Using an evaluative research approach, a teacher-developed assessment instrument was evaluated and rated against these principles. Data were collected through the analysis of language tests developed by teachers using documentation as a method. The study reveals that the teacher-made test generally meets all aspects of the principles, but some aspects require further attention. Accordingly, the study provides valuable insights and suggestions for improvement to address these areas of concern. Keywords Classroom Assessment; Test; Principles of Language Assessment; Assessment Literacy How to cite: Sumarsono, D., Arrafii, M.A., & Imansyah. (2023). Evaluating the Quality of a Teacher’s Made Test against Five Principles of Language Assessment, JOLLT Journal of Languages and Language Teaching, 11(2), pp. 225-237. DOI: https://doi.org/10.33394/jollt.v%vi%i.7481 INTRODUCTION Assessment is a crucial process involving the systematic gathering and interpretation of information concerning student performance, through a variety of methods and techniques. Its primary purpose is to provide reliable and relevant data that can inform both teachers and students. Teachers use assessment to make informed judgments about learners' progress, based on specific task criteria (Chapelle et al., 2015; Bajuti, 2018), and provide valuable feedback to enhance their teaching methods (Follmer & Sperling, 2019). Additionally, assessment enables teachers to determine appropriate next steps in the teaching and learning process. For students, assessment offers invaluable insights into their areas of strengths and weaknesses and provides guidance for achieving their learning goals through constructive feedback from their teachers (Harding et al., 2015; Su, 2020). In summary, assessment plays a pivotal role in enhancing the quality of teaching and learning, providing critical feedback to both teachers and students, and facilitating the achievement of educational objectives. The effectiveness of assessment in fulfilling its intended functions hinges on the quality of the assessment instrument employed to collect information about student learning (Williams et al., 2022; Aprianoto & Haerazi, 2019). As such, the use of high-quality instruments is critical. An effective instrument is one that conforms to the standards for quality and the principles of classroom assessment. These principles include practicality, reliability, validity, authenticity, and washback (Brown and Abeywickrama, 2018). Given the unique nature of each classroom context, only classroom teachers possess the knowledge and skills necessary to design assessment tasks that align with the classroom characteristics, thereby grounding the task's development in the classroom context. Consequently, teachers' https://e-journal.undikma.ac.id/index.php/jollt/index http://issn.pdii.lipi.go.id/issn.cgi?daftar&1366476729&1&& http://issn.pdii.lipi.go.id/issn.cgi?daftar&1524725326&1&& Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |226 assessment literacy and ability are paramount in the development of effective assessment instruments and practices (Stiggins, 1995, Popham, 2011, & Arrafii, 2021). The quality of teachers’ assessment relies heavily on teachers’ understanding of assessment methods and formats, knowledge with regard to the test item construction, and understanding of classroom assessment principles underlying the best practice (Giraldo, 2018; Fulcher, 2012). However, little is known about teachers’ ability in designing classroom test. In many cases, teachers normally use a test from publisher or textbook to measure their students’ performance (Fulcher, 2012). It is rarely the case that they have developed assessment instrument themselves for their pedagogic use. If there is, the quality issue remains in question. This research aims to collect and evaluate teachers’ made tests against five principles of language classroom assessment and indicate the levels of teachers’ assessment literacy. Principles of language classroom assessment When designing a language assessment task, there are five fundamental principles that need to be considered by test developers including teachers to ensure that their products are able to achieve its purposes. The principles include practicality, reliability, validity, authenticity, and washback (Hughes, 2003; Brown & Abeywickrama, 2018).Each principle is described further below. Practicality When asking if a certain assessment is feasible, we want to determine whether it is possible, or practical, to use it in our current teaching situation. This principle is concerned with the “logistical, down-to-earth, administrative issues involved in making, giving, and scoring an assessment” (Brown & Abeywickrama, 2018, p. 26).Money, time, and resources at school can have a significant influence on the kinds of assessment teachers are able to use. Understanding the teaching context will therefore both guide and constrain the choices the teachers are able to make about assessment (Graves, 2000). Reliability Reliability refers to the extent to which assessment results are consistent. When using a reliable assessment “you can be confident that someone will get more or less the same score, whether they happen to take it on one particular day or on the next…[but]the score is quite likely to be considerably different, depending on the day on which it was taken” (Hughes, 2003, p. 3) if assessment is unreliable. Theprincipal idea about reliability is to ensure that students achieve their scores or results because of their abilities, and not due to other factors. The reliability of assessment is dependent upon several factors, including students, graders, the way assessment administered and the nature of assessment itself (Brown and Abeywickrama, 2018) Student Related Reliability Personal characteristics and students background may influence assessment results. A student’s knowledge of particular subjects, cognitive style, gender and ethnic background may play a role in determining their results. There are also several temporary or random factors which may affect the reliability of assessment. Students may be ill, tired, anxious or simply having a bad day, and this can cause results to vary every time an assessment is given (Brown & Abeywickrama, 2018). Teachers also need to be aware of students’ knowledge of and strategies for taking tests or other kinds of assessment. Students may be very familiar with certain types of assessment or may have had a significant amount of practice or preparation before taking them. Some students may also have developed effective strategies for completing assessments, such as predicting the correct answer for multiple choice test questions (Brown & Abeywickrama, 2018; Davies et al., 2002). Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |227 Rater Reliability The rater is the person who marks, scores or judges an assessment. In many cases this person will be the teacher but in some cases the rater may be a professional language tester. As Davies et al. (2002) point out, raters are human and therefore capable of making mistakes which may influence assessment results. There are two aspects of rater reliability: inter-rater reliability and intra-rater reliability. The former refers to the similarity of scores given by two or more different raters to the same assessment. Differences in scores may be due to factors such as unfamiliarity with the criteria or scoring system, lack of attention to the criteria or scoring system, inexperience, fatigue or biases (Brown & Abeywickrama, 2018). The later refers to a single rater’s consistency over a number of assessments (Brown & Abeywickrama, 2018). Once again, there are a number of factors which may lead a rater to either apply a different set of criteria to each assessment or to apply the same criteria differently. These may include fatigue, the sequence assessments are marked in, and bias towards students one may perceive as ‘good’ or ‘bad’. Assessment Administration Reliability The conditions under which assessment occurs can also affect its reliability. Factors such as noise, lighting, legibility of test papers and condition of classroom furniture can lead to inconsistencies in assessment scores. Brown and Abeywickrama (2018) describe a situation in which noise coming from the street outside the classroom can prevent students hearing a tape recording during a listening comprehension test. The score these students received would more likely be a reflection of the interference of street noise rather than their listening comprehension ability. Assessment Reliability Finally, certain characteristics of the assessment itself can contribute to unreliability. Time limits, length of assessment, ambiguous questions and unclear instructions are among such factors (Brown & Abeywickrama, 2018). To take the example of time limits, students will take different amounts of time to complete tasks so if time runs out before a particular student is able to finish the assessment, his/her score will be affected. Similarly, time limits may influence the way students respond to the tasks. If for example a writing test requires students to write two essays in one hour, they might write the first very quickly, so they can complete the second essay. Validity Validity refers the credibility and trustworthiness. In the context of assessment, a valid assessment measures what it aims to measure, “does what it is intended to do” (Davies et al. 2002, p. 221). Another aspect of validity is the interpretations or uses of assessment results. Determining assessment being valid or not is not an easy task (Brown, 2004). However, people can look at several sources of evidence to help make the decision about the validity of a test. These sources of evidence can be in the formscontent validity, criterion validity, construct validity, consequential validity, and face validity (Brown and Abeywickrama, 2018). Content Validity Content validity relates to the content of an assessment. In short, the content of an assessment – questions, tasks and subject matter – should reflect the ability teachers are trying to assess (Brown, 2004). Hughes (2003) offers an example: it is obvious that a grammar test, for instance, must be made up of items relating to the knowledge or control of grammar. But, this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. Just what are the relevant structures will Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |228 depend, of course, upon the purpose of the test. It is less likely that an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners. If on the other hand a teacher was “trying to assess a person’s ability to speak a second language in a conversational setting, asking the learner to answer paper-and- pencil multiple-choice questions requiring grammatical judgements does not achieve content validity” (Brown & Abeywickrama, 2018). Therefore, if the content of an assessment matches the ability it is supposed to assess, then it has content validity (Brown, 2004). Criterion Validity Criterion validity refers to the relationship between assessment results and other indicators of language ability. In this case, if the results of the assessment that teachers use coincide with some other criterion, or benchmark, which are believed to provide a good indication of language ability, the task has criterion validity. There are two aspects of criterion validity: concurrent validity and predictive validity. Concurrent validity refers to how a student’s performance on a particular assessment compares to his/her performance on other measures of language ability at roughly the same time as the assessment was taken. The achievement of similar results on different assessments can also demonstrate concurrent validity. For example, a student may receive a high score on a classroom listening comprehension test and shortly afterwards receive a high score on the listening component of the IELTS exam. Predictive validity is the extent to which an assessment can predict how well an individual will be able to perform a particular task in the future. Construct Validity Language ability or proficiency is not a directly accessible tangible trait, and that language assessment is based on one’s view on the nature of language ability. A construct is a concept or definition of language ability and therefore concerned with how well an assessment represents the concept or definition of language ability upon which it is based (Bachman, 1990). The test developer must spell out just what that construct is or what it consists of. The test can be valid only if the test construct is a complete and accurate picture of the skill or ability it is supposed to measure. For test to have construct validity, the tasks a student is required to perform must be consistent with our definition of language ability. In other words, an assessment must “tap into” our concept of language ability (Brown, 2004, p.25). Consequential Validity It is important to bear in mind that language assessment does not occur in isolation but it is used within a broader social context and is used on people. This means teachers have to consider the consequences of language assessment (consequential validity). Some issues need to be considered regarding consequential validity includes whether the evidence of assessment works well enough to make appropriate decisions regarding students’ learning, the types of language abilities valued or perceived as important in assessment, the extent to which assessment results can be used as a reference to judge the potential performance of students in real life, the potential impacts of assessment on classroom instruction. All of these factors are likely to influence our decision about whether or not to use a certain type of assessment (Bachman, 1990). Face Validity Face validity is a critical concept in the realm of assessment, which pertains to the extent to which an assessment appears to measure what it claims to measure based on its physical characteristics. For instance, a reading comprehension assessment that entails reading a short newspaper article and answering questions about it is likely to appear to Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |229 measure reading comprehension (Bachman, 1990). Face validity is determined by the perceptions of teachers, assessors, and students, and if an assessment item appears to be well- designed, it is deemed to possess face validity (Heaton, 2000). However, it is important to note that face validity alone is inadequate to establish the validity of an assessment. Nonetheless, it is still crucial because the perceptions of stakeholders will affect their reactions to the assessment (Bachman, 1990). An assessment without face validity may result in numerous challenges, as Hughes (2003) elucidates that such a test may not be accepted by candidates, educators, education authorities, or employers, and if it is utilized, the candidates' response to it may not accurately reflect their abilities. Authenticity An important factor that requires careful attention is the authenticity of the language and tasks used in language assessments. The degree to which the language and tasks used in an assessment are representative of real-life situations is referred to as "authenticity." In other words, educators must analyze whether the evaluation accurately captures actual language use in natural settings. The definition of authentic language is "oral or written language examples that are not consciously produced for instructional reasons" (Nunan, 1999). Likewise, genuine activities are those that ask students to act in a way that closely mimics how they behave in actual life situations. For a precise evaluation of language competency, real language and tasks must be included in language assessments. Washback When choosing or designing assessment, teachers need to ask whether the assessment will create a positive influence on their teaching and students’ learning experience. As assessment can exert a powerful influence on teachers, learners and society in general,it is not surprising therefore that assessment can affect the nature of English language teaching and learning, including what aspects of the language are taught, the amount of time spent on particular aspects of the language and the teaching methodology used in classrooms. The effect an assessment has on teaching and learning is known as washback or backwash. An assessment can have either a positive or negative effect (McNamara, 2000). These five principles, when they were considered during the assessment design and development, can ensure effective classroom assessment. As the classroom teacher is a member of classroom community who interacts and engages with the classroom discourse intensively, teachers are considered the most knowledgeable person about the classroom context and thus are likely to become the best assessor of student learning. Due to a prolonged engagement in the classroom activities, compared to the outsiders, teachers can gauge more accurate and comprehensive evidence of students’ learning and development. However, until to date, we have a limited understanding about teachers’ performance in designing and developing classroom assessment instrument which helps them make the right decision regarding students’ learning. What we have known is that rather than developing their own instrument, teachers frequently adopt available instrument from the textbook to measure their students’ learning progress and achievement (Fulcher, 2012). Many of such adoption were proceeded without a proper adaptation and modification. Additionally, although some teachers reported that they have developed their own assessment instrument to capture students learning, the quality issue regarding this instrument remained persist (Fulcher, 2012; Popham, 2011). We still do not get informed about the extent to which these self-made assessment instrument addresses the quality issues and principles of an effective instrument. This research was framed and guided by the following research question: To what extend have the principles of language assessment been incorporated in the teacher’s made tests? Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |230 RESEARCH METHOD Given the purpose to uncover the quality of teachers’ made assessment against five principles of language classroom assessment, this research employed a case study design (Yin, 2009) by which one of the assessment tasks from a teacher working at English department at higher education level was selected to be evaluated thoroughly. Thus, the unit of analysis however in this study was the teacher’s made test instead of the teacher himself. The test was examined against the principles of language classroom assessment (Brown and Abeywickrama, 2018). Data Gathering Method The study gathered teachers’ assessment instruments which have been used by the teachersto gather evidence of student learning. In this study, assessment instrument refers to the teachers’ made language test for use in summative formal assessment such as mid or final semester assessment. This criterion excludes daily assessment tasks and exercises that were used to monitor student learning. A set of English tests was gathered from the participating teachersfrom English education department, Universitas Pendidikan Mandalika, Lombok. Researchers contacted the teachers in person and asked a copy of their assessment tasks. The researchers ensure that the assessment artifacts collected are teacher-made tests to ensure the trustworthiness and accuracy of the research findings. If the artifact is proven to be taken from a publisher, it was excluded for analysis. Given this inclusion criteria, a number of teachers’ made tests were brought forward for analysis. However, in this report we present the results of our evaluation on a single assessment task which was an essay writing task for sophomore students at English department (see appendix 1 for the details of the assessment task). This task was selected due to its features that characterise an effective assessment instrument and maybe considered as a model of good assessment task of essay writing. Data Analysis Method To evaluate the quality of teachers’ assessment instruments, data from this research were analysed using qualitative and quantitative methods. Qualitative method is a method used to describe data using words but may also using scores (quantitative) to develop a robust analysis to support the qualitative description of the data. The usage of both methods is considered powerful to strengthen the description of the research data. To do this, comparative and contrasting analysis methodswere used (Mahsun, 2017). Initially, the assessment task was read, annotated and evaluated against the principles of assessment design and development (Brown and Abeywickrama, 2018). This process requires a description sheet to capture evidence of congruence between teachers’ assessment instruments and the principles. To indicate the quality of assessment instruments, the instrument was rated in a scale from 1 (poor) to 5 (excellent) (see appendix 2for the details of this rating system). The teachers’ made assessments, which have been rated, were tabulated and computed to measure the overall quality of the assessment instrument. From this evidence, a category of the quality of assessment instrument was developed. To ensure trustworthiness of the analysis, intra-rater and interrater strategies wereemployed. Initially, the principal investigator analysed the data and then revisited them a few weeks later. Then, the same data was evaluated by the co-researcher. Then we sit together to discuss the data to arrive at final evaluation. The results of our final, agreed analysis (scoring) can be seen in appendix 2. RESEARCH FINDINGS AND DISCUSSION To answer the research question regarding the extent to which language assessment principles were incorporated in the teachers’ classroom assessment instrument, teachers’ assessment artifactswere collected and analysed according to five language assessment principles proposed by Brown and Abeywickrama (2018) that include practicality, reliability, Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |231 validity, authenticity, and washback. The teacher made test in this study is described and connected in relation to each principle of language assessment. The results of analysis were displayed in the appendix 2: Practicality When dealing with the issues of educational resources within and outside the classroom, issues regarding practicality principle of this assessment can be considered high(rated 5 out of 5) because teachers do not need to spend a large amount of money to prepare and apply the test in the classroom. Neither do they need equipment for designing, collecting, administering, and evaluating ofstudents’ work from this test. The test just requires several pieces of paper.Teacher could also administer the task either orally through dictation of the instructions and question or written through providing students with the writing instruction and questions on the board. For this reason, this test can be used in different educational contexts, e.g., remote or urban schools, or school with either high or low socioeconomic status, small or big classroom size. In terms of time spent for designing the test, it requires relatively short time. However, designing assessment rubric that mirrors students’ ability in writing expository essay might be challenging and time consuming for some teachers. In addition, time spending on markingstudents’ work based on this test might be another issue associated to practicality principal. Teachers can use holistic scoring rubric which believed to be effective rubric of students’ writing ability (Brown & Abeywickrama, 2018). However, teachers need more time to grade the task, especially when it is employed in a large classroom size. In this regards, this test is therefore less practical compared to other types of assessment such as multiple choices. In this aspect the practicality of this test is rated 3 out of 5) Reliability The overall score for the reliability principle of assessment is 3.6/5.0. In terms of student-related reliability, this test reliability can be considered high(score 4/5) because it contains the topic which is closely related to their life and past experience.It is assumed that students may have had background knowledge of the topic of presented in test items. The test asks students to recall their childhood experience related to the most favourite game they played as a kid. This kind of live experience is unique for each student. This will encourage and motivate them to write because are knowledgeable about the topic. With regards to inter-rater reliability, this reliability issue of this assessment was score 5 out of 5 because it is dependent upon the raters’ quality, experience, perspectives and perhaps qualifications which can influence they students’ work graded(Weigle, Boldt, and Valsecchi,2003). Inter-rater reliability could be low if the raters have had different perspectives in assessing students’ work and used different contents of rating scales. On the other hand, inter-rater reliability can be high if two or more raters have had an agreement for assessment criteria to be used (Hughes, 2003). Regarding assessment administration reliability, assessment reliability relies heavily on several factors such as noise, classroom size, number of students taking the test. When this test is delivered in a non-disrupted atmosphere, its administration reliability can be high, or otherwise. Therefore, we score this aspect 5/5. Assessment reliability of this test can be high because it has a clear understandable instruction, followed by several scaffolding questions which limit students’ freedom in providing response (Hughes, 2003). However, a further instruction telling students about this should be displayed so students know more what to include in their essay.For this reason, assessment reliability was scored 4/5. Validity We rated 4 out of 5 for the validity of this assessment overall. All validity aspects were score 4 each. With regard to the content validity,thevalidity of this testcan be high Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |232 because the questions, tasks and subject matter reflect the extensive writing skills to betested, indicated by a number of words required (300 words). In addition, there is a match between the targeted writing genre and instruction as well as questions being presented in this assessment. The task for students is to write long stretch of an expository essay, complemented with a sequence of instructions that guide students to compose an extensive writing. However, with regards to the criterion validity, the validity of these assessment might be considered low due to the absence of other measures of language abilities. Nevertheless, the predictive validity might be achieved as the result of this assessment can be a reference to predict students’ ability in writing an extensive textin the future occasion. In terms of construct validity, this test is valid because the taskis consistent with the definition of language ability targeted in some ways: Firstly,extensive writing is a long scratch of writing that is well-connected, and this assessment asks students to write a 300- word expository essay which meets the criteria of extensive writing. Secondly, expository essay asks students to clarify information and provide reasons to the readers. This test asks the test takers to describe and explain why a particular game in their childhood had become a favourite game they played when they were kid. Thirdly, the task provided students some assessment rubric and this leads students to know the structure of the essay being assessed. If students know these assessment components, they can focus on them when developing their essays and therefore it improves the construct validity of the assessment. The test analysed here may be considered to meet sequential validity because it can be used to make judgment and decision about students’ performance. With comments and grades from markers or raters, it informs students how well they learn. Moreover, the assessment results can be used by teachers as a reference to improve their instruction, serving a formative function of assessment (Black and Wiliam, 2018). Dealing with face validity, this test meets the validity principles. its instruction and question require students to write, not to listen nor reading nor speaking. There is also a clear writing genre to be composed by students. Hughes (2003) argues that good test is that the one tests only targeted skills, in this case writing ability. Similarly, Heaton (2000)asserts that a test can have face validity when teacher, students, and assessors feel alright to assessment items. As evaluator of this task, we agreed that the task preserves face validity. Authenticity Authenticity principle of language assessment can be evaluated from the extent to which the assessment languages is natural and clearly reflect the authentic real-life situation. For this principle, all aspects of authenticity principle were scored 5. We rated the test 5 overall for some reasons. First, it explicit from the test that the test items are presented in natural and communicative way. This leads to situation where the test instructions are understood well by test-takers and evaluators. Second, the test is contextualised as a context is presented in the topic or material of the test (Firman et al., 2021). Instruction of the test introduces the context of assessment in which the test-takers must relate the topic to their past (childhood) experience. Third, the topic is meaningful, interesting, and relevant to students because the topic is well-known by students and is closely related to them. O'Malley & Pierce (1996) found that that providing students an interesting, relevant and well-known by studentscan help students to improve their writing performance. In addition, writing a topic that is familiar with personal circumstance could increase students’ interest (Pajares & Schunk, 2005) As there is only one task to be performed by students, one might argue that the task is not presented thematically. However, it is not significant issue affecting the authenticity principle because there is clear sequence of instruction that can help students tocomplete the task. Moreover, the topic and the context are introduced early in instruction allowing students to activate their prior knowledge about the topic. This was followed by next instruction that leads to the focus of task. Then students are given follow-up instructions that provide students Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |233 an approach to finish the task, and finally are given reasonable amount of time to respond to the task. Washback A test can produce meaningful washback if it positively affects students and teachers about how well they do their jobs, determines what and how teachers teach and students learn, provides adequate time for preparation, gives learners constructive feedback that boosts language development, is more formative than summative oriented, equips opportunities for learners to achieve peak performance (Brown & Abeywickrama, 2018). Considering these features of positive washback, we gave 4.5 for the washback aspect of principle for this task. Firstly, the task analysed in this study assesses pupils’ capacity in composing an expository essay. This assessment tests students’ ability in describing, explaining, and providing information to the readers. This target is reflected in the tasks instruction and questions that ask students to describe one of the most favourite games they played when they were kid and provides reasons for selection. Hence, the instruction and questions of the task meet particular characteristics of expository essay. It does not ask students to write other writing genres. Secondly, this assessment promotes opportunities for students to perform self- or peer assessment on their works (Dolba et al., 2022). This practice is suggested as it helps students to discover their learning weaknesses and strengths, and this subsequently improves performance (Baars et al., 2014; Haerazi & Kazemian, 2021). Research has indicated that training on the use of peer or self-assessment can be effective for improving their writing performance in the future task (Kostons et al., 2012). Further, this assessment provides written formative feedback for students which can be used by students to enhance their writing development. Thirdly, as the test provides guided questions for students to complete the task, teachers and students understand the instruction and task are understood in the same way. In formative assessment practices, similar understanding between the teachers and students of instruction and learning target can increase learning opportunities and outcomes (Arrafii, 2021). Lastly, the test has provided some opportunities for student to practice and apply skills and knowledge in writing an expository essay. This competence is undoubtedly crucial and useful for their future life, especially for those who want to pursue a career as a journalist or writer. However, some characteristics of positive washback proposed by Hughes (2003) are not fully accommodated in this test. For example, this test merely provides single task that is to write expository essay, while lacking in providing multitasks.It is likely that if the task is single, students’ preparation for the test might be reduced to include this type of ability or performance. CONCLUSION The teachers’ made assessment reported in this paper can be considered as a very good example of assessment of writing task because it satisfies all five principles of classroom assessment although some minor issues still appear. However, the finding reported here cannot be applicable to other teachers as they may have developed different level of expertise with regard to assessment design and development. Therefore, as Popham (2011) has pointed out that only a small number of teachers have had adequate assessment literacy and the respondent of this research being considered belong to this minority, an interventionist research approach through professional development involving a large number of teachers is considered an effective way to improve teachers’ assessment literacy, especially on area of test item design and development. Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |234 REFERENCES Aprianoto, & Haerazi. (2019). Development and Assessment of an Interculture-based Instrument Model in the Teaching of Speaking Skills. Universal Journal of Educational Research, 7(12), 2796–2805. https://doi.org/10.13189/ujer.2019.071230 Arrafii, M. A. (2021). “We must assess all, even a student farting [is also assessed] for the behavioural aspect of learning”: Teachers' conceptions of assessment in the context of assessment reform in Indonesia. The Curriculum Journal, 00, 1–23. https://doi.org/10.1002/curj.130 Arrafii, M. A. (2021) Assessment reform in Indonesia: contextual barriers and opportunities for implementation, Asia Pacific Journal of Education, DOI: 10.1080/02188791.2021.1898931 Baars, M., Vink, S., van Gog, T., de Bruin, A. &Paas, F. (2014) Effects of training self- assessment and using assessment standards on retrospective and prospective monitoring of problem solving, Learning and Instruction, 33, 92-107 Bachman, L. F. (1990) Fundamental considerations in language testing. Oxford: Oxford University Press. Baiutti, M. (2018). Fostering assessment of student mobility in secondary schools: Indicators of intercultural competence. Intercultural Education, 29(5–6), 549–570. https://doi.org/10.1080/14675986.2018.1495318 Black, P. & Wiliam, D. (2018) Classroom assessment and pedagogy, Assessment in Education: Principles, Policy & Practice, 25:6, 551-575 Brown, H. D. & Abeywickrama, P. (2018) Language assessment: Principles and classroom practices (3rded.). New York: Pearson Education Inc Brown, H. D. (2004) Language assessment: Principles and classroom practices. New York: Pearson Education Inc. Chapelle, C. A., Cotos, E., & Lee, J. (2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385–405. https://doi.org/10.1177/0265532214565386 Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (2002) Dictionary of language testing. Beijing: Foreign Language Teaching and Research Press. Denscombe, M. (2007) The good research guide: For small-scale social research projects (3rd edition) Buckingham, UK: Open University Press. Dolba, S., Gula, L., & Nunez, J. (2022). Reading Teachers: Reading Strategies Employed in Teaching Reading in Grade School. Journal of Language and Literature Studies, 2(2), 62–74. https://doi.org/10.36312/jolls.v2i2.874 Firman, E., Haerazi, H., & Dehghani, S. (2021). Students’ Abilities and Difficulties in Comprehending English Reading Texts at Secondary Schools; An Effect of Phonemic Awareness. Journal of Language and Literature Studies, 1(2), 57–65. https://doi.org/10.36312/jolls.v1i2.613 Follmer, D. J., & Sperling, R. A. (2019). Examining the Role of Self-Regulated Learning Microanalysis in the Assessment of Learners’ Regulation. The Journal of Experimental Education, 87(2), 269–287. https://doi.org/10.1080/00220973.2017.1409184 Fulcher, G. (2012) Assessment Literacy for the Language Classroom, Language Assessment Quarterly, 9, 2, 113-132, Giraldo, F. (2018) Language assessment literacy: Implications for language teachers. Profile: Issues in Teachers’ Professional Development, 20, 1, 179-195 Graves, K. (2000) Designing language courses: A guide for teachers. Boston: Heinle and Heinle. https://doi.org/10.13189/ujer.2019.071230 https://doi.org/10.1002/curj.130 https://doi.org/10.1080/14675986.2018.1495318 https://doi.org/10.1177/0265532214565386 https://doi.org/10.36312/jolls.v2i2.874 https://doi.org/10.36312/jolls.v1i2.613 https://doi.org/10.1080/00220973.2017.1409184 Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |235 Haerazi, H., & Kazemian, M. (2021). Self-Regulated Writing Strategy as a Moderator of Metacognitive Control in Improving Prospective Teachers’ Writing Skills. Journal of Language and Literature Studies, 1(1), 1–14. https://doi.org/10.36312/jolls.v1i1.498 Harding, L., Alderson, J. C., & Brunfaut, T. (2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing, 32(3), 317–336. https://doi.org/10.1177/0265532214564505 Heaton, J. B. (2000) Writing English language tests (new ed.). Beijing: Foreign Language Teaching and Research Press. Hughes, A. (2003) Testing for language teachers (2nd ed.). Cambridge: Cambridge University Press. Kostons, D., van Gog, T. &Paas, F. (2012) Training self-assessment and task-selection skills: A cognitive approach to improving self-regulated learning, Learning and Instruction 22, 121-132 Mahsun. (2017). Metode penelitianbahasa.Depok: PT Rajawali Pers. McNamara, T. (2000) Language testing. Oxford: Oxford University Press. Nunan, D. (1999) Second language teaching and learning. Boston: Heinle and Heinle. Popham, W. J. (2011) Assessment literacy overlooked: A teacher educator’s confession, The Teacher Educator, 46, 4, 265-273. Stiggins, R. J. (1995) Assessment Literacy for the 21st Century, The Phi Delta Kappan, 77, 3, 238-245 Su, H. (2020). Educational Assessment of the Post‐Pandemic Age: Chinese Experiences and Trends Based on Large‐Scale Online Learning. Educational Measurement: Issues and Practice, 39(3), 37–40. https://doi.org/10.1111/emip.12369 Williams, T., Wiener, J., Lennox, C., & Kokai, M. (2022). Lessons Learned: Achieving Consensus About Learning Disability Assessment and Diagnosis. Canadian Journal of School Psychology, 37(3), 215–236. https://doi.org/10.1177/08295735221089457 Yin, R. K. (2009) Case Study Research: Design and Methods (4 ed.). London: SAGE Publications, Inc. https://doi.org/10.36312/jolls.v1i1.498 https://doi.org/10.1177/0265532214564505 https://doi.org/10.1111/emip.12369 https://doi.org/10.1177/08295735221089457 Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |236 Appendix 1:A sample of teacher-made test designed to measure students’ performance in writing expository essay Sumarsono, Arrafii, & Imansyah Evaluating the Quality of……….. JOLLT Journal of Languages and Language Teaching, April 2023. Vol.11, No.2 |237 Appendix 2: The result of researchers’ evaluation on teacher-made test (appendix 1) in each principle of language assessment No Criteria (Assessment principles) Rating 1 2 3 4 5 1 Practicality a) time for design, administration, marking X b) money X c) resources/equipment X 2 Reliability a) student-related reliability X b) rater reliability (intra- and inter-rater) X c) assessment administration Reliability X d) assessment reliability X 3 Validity a) content validity X b) criterion validity X c) construct validity X d) consequential validity X e) face validity X 4 Authenticity X a) language is as natural as possible X b) questions/tasks contextualised, not isolated X c) topics meaningful, interesting, relevant to students X d) questions/tasks organised thematically X e) questions/tasks closely reflect real life X 5 Positive Washback a) assess abilities we want students to develop X b) include wide range of questions/tasks X c) vary questions/tasks over time X d) direct assessment X e) criterion-referenced assessment X f) assessment based on objectives X g) assessment well understood by students and teachers X Total Note: Excellent (5); Very Good (4); Good (3); Satisfactory (2); Poor (1) Principles of language classroom assessment Practicality Reliability Student Related Reliability Rater Reliability Assessment Administration Reliability Assessment Reliability Validity Content Validity Criterion Validity Construct Validity Consequential Validity Face Validity Authenticity Washback Data Gathering Method Data Analysis Method Practicality Reliability Validity Authenticity Washback Appendix 1:A sample of teacher-made test designed to measure students’ performance in writing expository essay Appendix 2: The result of researchers’ evaluation on teacher-made test (appendix 1) in each principle of language assessment