This study is intended to understand teaching quality of English student teachers when they conduct their teaching practicum. Teaching quality is conceptualized based on the principles of effective teaching resulted by teacher effectiveness studies. Thes IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 265 Developing an Indonesian Reading Proficiency Test for BIPA Learners ANDIKA EKO PRASETIYO 1 Abstract The use of Indonesian proficiency tests for non-native speakers of Bahasa Indonesia is still equated with tests for native speakers. This has become a point of debate for many teachers and experts of Indonesian for Foreigners (Bahasa Indonesia untuk Penutur Asing - BIPA). The crux of the debate focuses on whether the same proficiency test should be used for both native speakers (NS) and non-native speakers (NNS) alike, or whether separate tests should be developed. In accordance with the peculiarities of Bahasa Indonesia, Indonesian proficiency tests for NS and NNS should be differentiated. The underdevelopment of specialized proficiency tests for NNS can be explained by the fact that Bahasa Indonesia is not one of the dominant languages learned in the world today. This research aims to develop materials for an Indonesian proficiency test for NNS. The development of the test focuses on reading comprehension. To advance development of the test, discussions of the processes for defining the theoretical construct as well as empirical analysis of students' result were combined. The method used in this study involved expert review, text readability analysis, and item analysis. The findings show that the test items developed can be used to test students’ proficiency, particularly in reading comprehension. Keywords BIPA, foreign language, Indonesian, language testing, reading 1 A fulltime graduate student at the University of Melbourne, Melbourne, Australia; andikaekop@gmail.com mailto:andikaekop@gmail.com IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 266 Introduction A proficiency test developed for Indonesian language learners is called Uji Kemahiran Berbahasa Indonesia (UKBI) is used as an Indonesian language proficiency test for both foreign speakers and native speakers. However, the assessment instruments used to test native speakers (NS) and non-native speakers (NNS) should be differentiated since the test objectives and test-takers are distinct. Based on this issue, we sought to conduct research into the development of Indonesian language tests that are used to measure the Indonesian reading ability of NNS. Therefore, the product resulting from this study is a proficiency test that was developed for Indonesian language learners. Furthermore, the test developed can form a recommendation and an alternative for the language center as a measurement tool in addition to UKBI. The test focuses on the reading comprehension aspect of testing. This test material will refer to the CEFR curriculum in which, at the advanced level, speakers must be able to read and comprehend of all forms of written language including structurally and linguistically complex texts such as abstracts, manuals, scientific articles, and literary works. A pilot study has also been included to ensure that the developed test has reliability and readability. The test was then administered to the students at the University of Melbourne, Semester 2 2018, in the subject Indonesian 3. The objectives of this study, three main questions will be explored. (1) Based on the content validity, does the test reflect the course objectives? (2) What is the level of difficulty, index of discrimination, and distracters of each item? (3) What revisions are to be made of test items based on the test analysis? Literature Review Reading comprehension The skill of reading comprehension is one of the most critical aspects of learning a language. For this reason, reading tests are now a crucial part of most major foreign language assessment protocols including TOEFL, IELTS, and TOEIC. In the last decade, many studies have investigated reading comprehension tests for foreign languages (e.g., Bernhardt, 1983; Gorsuch & Taguchi, 2008; Gorsuch & Taguchi, 2010; Keenan, Betjemann, & Olson, 2008; Rahmiati & Emaliana, 2017; Taguchi, Gorsuch, Takayasu-Maass, & Snipp, 2012; Taguchi, Takayasu-Maass, & Gorsuch, 2004). Tests of reading comprehension have become some of the most important instruments with which to measure a learner’s proficiency in a foreign language acquisition. This is because reading tests have demanding characteristics in terms of cognition, requiring the synchronisation of memory, attention, as well as comprehension (Sellers, 2000). In addition, reading comprehension tests can also involve both low order and higher orders of thinking (Rahmiati & Emaliana, 2017). This can be seen from the variety of texts IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 267 presented in reading comprehension tests, including expositions, news, and literature. Reading comprehension tests also require several key characteristics in order to be considered sound and reliable. Firstly, the test must have validity and a relevant construct (Hughes, 2003). Secondly, the items included in a reading test should have reliable and consistent characteristics in terms of producing results (Brown, 2004). Thirdly, the reading test should be able to distinguish the level attained by the learner, such as whether the learner has achieved a primary, intermediate, or advanced level of language proficiency (Heaton, 1989). Finally, in terms of practicality, reading tests should also be effective and efficient to administer (Weir, 1990). Question types in reading test There has also been some discussion about the types of the questions that should be included in such reading tests. In his study, Pyrczak (1975) found that there was no significant difference in results between students who read the passage before answering, and students who did not read the passage when completing a multiple choice reading test. In addition, Jones (1977) argues that a proper foreign language reading test should utilise model translation. However, he stated that it would be difficult to assess since it might be more focused on grammatical aspects rather than meaning. Meanwhile, Cranney (1972) suggests that the method of cloze reading is an excellent way to test reading skills. He also said that cloze reading is easy to produce and to score. Shohamy (1981), however, found that students have a negative perspective towards cloze reading. She found that students often felt that cloze tests were tough and frustrating. On the other hand, there is a study which supports the use of the multiple-choice method in reading test. Gorjian (2013) argues that tests with large numbers of participants are more suitable to the multiple choice question type. Based on this final theory, we have chosen to use a multiple-choice type format in developing a reading comprehension test for Bahasa Indonesia as a foreign language. Empirical studies on foreign language test in reading Regarding published research on the development of foreign language tests, several studies have investigated the area of testing for reading comprehension (e.g., Nindyaningrum, 2018; Rahmiati & Emaliana, 2017; Saifudin, Suwandi, & Setiawan, 2014). However, studies specifically exploring reading comprehension of Bahasa Indonesia as a foreign language are limited. Saifudin et al., (2014) developed an instrument that can be used as a measure of the proficiency of NNS in Bahasa Indonesia. In the development of this instrument, they adopted the international standardized test model, IELTS. However, they focused on all of the skills of language acquisition and proficiency, not simply reading ability and comprehension. Rahmiati and Emaliana, (2017) also developed a reading comprehension test, but only for English as a IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 268 foreign language. The development of the reading test in her study focuses on both higher and lower order thinking of the students. Moreover, the type questions which were developed in her study were multiple choice format. Nindyaningrum (2018) conducted a study on the development of reading comprehension test instruments for NNS. The instrument that she developed can be used to measure the reading proficiency of Indonesian learners. This piece of research will mirror the study by Nindyaningrum (2018). It should be noted however, that Nindyaningrum (2018) did not perform a test item analysis including, for example, descriptive statistics, facilitation value, discrimination index, nor distractor analysis in her study. To try and address this issue, this study will therefore also develop a reading test analysis that tries to include such items. Descriptive statistics, item analysis, item facility, item discrimination Descriptive statistics in developing reading proficiency test are beneficial to examine the students' score distributions in the test. The aim of the proficiency test is to distinguish the level of learners' competencies in comprehending the reading. Therefore, the score distribution may indicate the level of the students' competencies, which are low, medium, and advanced. On the other hand, the score distribution can also indicate the level of the difficulty of the questions (Brown & Hudson, 2002). To examine the score distribution in the reading test, Brown and Hudson (2002) suggest using the measures of central tendency, i.e. mean, mode and median which is part of the descriptive statistics. Item analysis includes item facility and item discrimination. These two types of analysis are used to determine which items of the questions can be chosen and which items of questions need to be changed. The level of difficulty or the measurement of whether the test item is easy or difficult can be identified by calculating the value of item facilities, also known as item difficulty. In terms of measuring the item facility in proficiency reading tests, there are two methods for calculation. The first way to identify the item difficulty is by measuring the number of correct items answered by test takers and then divide by the total number of test-takers (Bachman, 2004; Farhady, 2012). In addition, an alternative method is proposed by Bachman (2004, p.122) who suggests calculating “the proportion of test takers who chose the different distractors” in order to measure the difficulty level of items. Item discrimination in proficiency tests refers to the ability of the item to distinguish the level of test takers' proficiencies, such as that of basic, intermediate, and advanced learners. In order to determine the item discrimination value, the number of test takers who give the correct answer to each test item is calculated and these numbers are used in a formula for discrimination index (Bachman, 2004). The value range of item discrimination is between -1 and +1. A higher value of item discrimination is better. Higher item discrimination indicates that the item is very effective for identifying the proficiency level of test takers (Farhady, 2012). IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 269 Methodology To strengthen the development of the test, the processes of theoretical construct definition are discussed along with empirical analysis of students’ test results. The method used in this study involved several steps, such as the expert judgment, analysis of text readability, and analysis of test items. The outline of test specifications was designed before creating the test items. There are several steps in developing the test: 1) developing outline of test specifications 2) writing the blue print of the test; 3) writing the test items; 4) validating the test by an expert; 5) administering the test; 6) analyzing the test result; and 7) revising the test. The test was developed to measure the comprehension of learners in the reading of different kinds of text genres. Each item of the test relates to readings in Bahasa Indonesia of various types, such as exposition text, news, and literature in the form of short stories. Each text has a length of about 136 - 295 words adapted from various sources. Topics and the features of Bahasa Indonesia are carefully transcribed into text, questions, and multiple choice alternatives. The micro skills tested include understanding topics, main ideas, supporting details, implied details, word meaning, as well as drawing conclusions from texts. Moreover, the expert consulted, a University of Melbourne lecturer, stated that the test developed is feasible and ready to be used for testing. Based on this evidence, we conclude that the content and item of the reading test is valid. Participants This study was conducted at the University of Melbourne involving 32 students between the ages of 18 and 27. Each subject was taken from one of either two different classes, but still in the same subject, Indonesian 3 which is a Bahasa Indonesia class considered to be at intermediate level. The students consist of 16 males and 16 females. All were NNS of Bahasa Indonesia originating from 7 different countries, namely Australia (N = 24), Malaysia (N = 1), Brunei Darussalam (N = 1), England (N = 1), USA (N = 1), Singapore (N = 1), and Indonesia (N = 1). It should be noted that the one student from Indonesia has lived for a long time in Australia and uses English as their everyday language. Furthermore, when asked to self-rate their level of proficiency, 3 students were rated as advanced learners, 19 as intermediate, and 10 below intermediate. Regarding the duration of learning Bahasa Indonesia, 14 students had been studying the language for less than 1 year, 7 students studying about 2 - 5 years, and 11 students studying for 6 years and above. Also, in terms of the level of reading intensity in Bahasa Indonesia, for example through magazines, books, and newspapers, 31% of students stated that they never do such reading, 44% said rarely, and 25% said do some reading but not extensive. IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 270 Procedures and test item writing and piloting test In terms of the procedures of the test development, there were two main methodologies that were utilized, which were test development (test item writing and piloting test) and test administration. I developed test items based on an example of a Bahasa Indonesia proficiency test instrument. However, they are designed to meet the purpose of the test which is to measure the proficiency of NNS of Bahasa Indonesia in reading. Furthermore, the questions are based on three different types of authentic text, which are exposition, news, and literature texts. Initially, we developed a test with a variety of topics and different types of texts, including a personal letter, news, and literature. The first text is a personal letter (constructed by the researcher). The second text is news about a museum fire that occurred in Jakarta written by Nurito (2018). The last text is a literature text, which is a short story entitled “Anak Kebanggan” by Navis (2018). The story was edited to be of an appropriate length and readability. No other significant changes were made to each of the three texts, other than length and readability. Regarding the number of questions, there are 20 question, with 5, 7, 8 questions for texts 1, 2, and 3, respectively. Also, each item has one correct answer and 3 distractors. All the items were aimed to measure learners of Bahasa Indonesia in their reading comprehension. Then, we designed 20 multiple-choice questions based on three short texts, a letter text, a news text, and a short story. The length of time allotted to do the test was 20 minutes. Furthermore, to measure test readability, we conducted a pilot test pilot with 3 NNS students to ensure that the test developed were feasible. In addition, we also consulted both via email and direct discussion with the lecturer of Indonesian 3. This was with a view to gaining more input and feedback regarding the test. Based on the test pilot, the lecturer gave a positive response to the test. However, some of the questions in text 1 (letters) are too easy, and most students answered them correctly. In addition, the teacher also gave input in our discussion that the comprehension of the letter text did not match the construct of relevance to real life. Therefore, we revised the first text by transforming it into an exposition text. In addition, we also adjusted the layout by providing row numbers on the left side of the text. Administration of the test The test was conducted twice in two different sessions of the same subject, Indonesian 3. The tests were administered on 21 and 22 May 2018 with time duration of 45 minutes in each class. Before conducting the test, we were assisted by the teacher explaining to the students about the purpose of the research. Teachers also helped by explaining that the tests might help them to prepare for final exams or improve their proficiency in Bahasa Indonesia, especially in reading. Participants were given 25 minutes to complete 20 reading questions. Before working on the questions, students filled out a list of background questionnaires, including names, IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 271 personal information, country of origin, previous experiences while learning Bahasa Indonesia, and self-assessment. This data was collected in addition to test score results to help identify other variables that may contribute to the variability of the test results. Findings Descriptive statistics of the test results Table 1 shows the reading comprehension level of Indonesian as a foreign language in this study had a mean of 11.7 out of 20 and (SD = 3.6). This means that 58% of the test items were able to be answered correctly by students. Furthermore, the results also show that the lowest value is 20% (N=1), while the highest score is 100% (N=1). To determine the learners’ level, we adopted the TOEFL level rubric, which describes elementary level (0% - 50%), low intermediate (51% - 75%) high intermediate (76% - 85%), advanced (86% - 100%). Based on this scale, it can be reported that 11 students were at the elementary level, 17 students were at low intermediate, 1 student was at high intermediate level, and 3 were at advanced level. These results indicate that the test developed was appropriate and not too easy or too difficult for NNS of Bahasa Indonesia. Table 1. Descriptive statistics of the test ID Score / 20 Level ID Score / 20 Level 1 12 Low Intermediate 17 13 Low Intermediate 2 10 Elementary 18 16 High Intermediate 3 9 Elementary 19 7 Elementary 4 6 Elementary 20 15 Low Intermediate 5 9 Elementary 21 14 Low Intermediate 6 7 Elementary 22 15 Low Intermediate 7 11 Low Intermediate 23 13 Low Intermediate 8 8 Elementary 24 13 Low Intermediate 9 11 Low Intermediate 25 20 Advanced 10 11 Low Intermediate 26 4 Elementary 11 13 Low Intermediate 27 14 Low Intermediate 12 13 Low Intermediate 28 11 Low Intermediate 13 14 Low Intermediate 29 18 Advanced 14 9 Elementary 30 11 Low Intermediate 15 7 Elementary 31 9 Elementary 16 13 Low Intermediate 32 18 Advanced IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 272 The difficulty level of each item in the test The item difficulty for each item was analyzed by using MS Excel (IF and point-biserials.xlsx). Each item has a range of 0.00 to 1.00. The interpretation of that number is the higher the value, the easier the test item. Furthermore, based on Djiwandono (1996), the indicators of item difficulty are as follows, easy (0.7 – 1), moderate (0.3 – 0.7), difficult (0 – 0.3). Table 2 below presents the facility value of each item. Table 2. The facility value of each item Item No. Item Facility Value Interpretation 1 0.91 Easy 2 0.91 Easy 3 0.81 Easy 4 0.28 Difficult 5 0.69 Moderate 6 0.03 Too Difficult 7 0.78 Easy 8 0.66 Moderate 9 0.81 Easy 10 0.47 Moderate 11 0.88 Easy 12 1 Too Easy 13 0.63 Moderate 14 0.41 Moderate 15 0.88 Easy 16 0.47 Moderate 17 0.22 Difficult 18 0.19 Difficult 19 0.25 Difficult 20 0.44 Moderate The data in table 2 show that the item facility value is varied. A total of 8 (40%) items are categorized as easy items, 7 (35%) as moderate, and 5 (25%) as difficult items. Based on this analysis, the distribution of the proportion of item difficulty, such as easy, moderate, and difficult was balanced and appropriate, because the test developed is a proficiency test. However, there is one item that was too difficult with IF 0.03 (item number 6), and one item was too easy with IF 1 (item number 12). Thus, these two items will be revised to make the facility value in each item more appropriate. IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 273 The discrimination index of each item in test To see how well an item can differentiate between higher and lower level learners, the discrimination index values can be examined. The higher the discrimination index value, the better the item distinguishes between the higher and lower level test takers (Farhady, 2012). Furthermore, if the discrimination index value equals 0, it indicates that low and high learners show the same performance, whereas a negative discrimination index value of indicates that lower students perform better than the higher students. D value 0 and - suggests that the item needs to be deleted or revised. The analysis of discrimination index values in this study was conducted using SPSS. Table 3. The discrimination index value Item No. DI Value Interpretation 1 .161 Enough 2 .161 Enough 3 .376 Good 4 .205 Enough 5 .292 Enough 6 .371 Good 7 .267 Enough 8 .372 Good 9 .210 Enough 10 .459 Very good 11 .414 Very good 12 0 No discrimination 13 .217 Enough 14 .603 Very good 15 .220 Enough 16 .440 Very good 17 .458 Very good 18 .492 Very good 19 .569 Very good 20 .621 Very good Based on table 3, most values indicate good items (N = 11, D > 0.3). However, a total of 8 test items indicate as an adequate test (0.11 - 0.29), while one item indicates the item should be revised because the item shows that the number of discrimination index value is 0 or no discrimination. IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 274 Item distracters Another way to investigate the item difficulty is by calculating “the proportion of test takers who chose the different distractors” (Bachman, 2004, p.122). The performance of each distractor can be seen from the analysis of item distracters. The test item which has distracters that are never chosen are useless and need revision. However, distractors that attract a large number of test-takers might not be clear and need to be reviewed. Table 4. Item distracters No % of A's % of B's % of C's % of D's No % of A's % of B's % of C's % of D's 1 0 90.6 6.25 3.13 11 3.13 87.5 3.13 0 2 0 90.6 9.38 0 12 0 0 100 0 3 9.38 81.3 3.13 0 13 12.5 15.6 9.38 62.5 4 25 28.1 40.6 6.25 14 18.8 40.6 21.9 12.5 5 0 6.25 68.8 18.8 15 6.25 0 87.5 0 6 56.3 15.6 25 3.13 16 12.5 15.6 15.6 46.9 7 78.1 15.6 3.13 0 17 15.6 25 21.9 25 8 12.5 65.6 9.38 12.5 18 9.38 37.5 18.8 18.8 9 6.25 81.3 9.38 0 19 25 3.13 40.6 21.9 10 0 50 46.9 0 20 43.8 18.8 6.25 15.6 The table above indicates that the question items have several types of distractors. Firstly, there are 9 items which have one or two distractors that were never chosen by test takers, such as the item number 1, 2, 3, 5, 7, 9, 10, 11, and 15. Secondly, there is one item (the item number 6) which has a distracter and was chosen by many test takers (option A, 56.3%). Lastly, all distractors in the item question number 12 were never selected by test takers since the question item is arguably quite easy. Revision of test items Based on analysis of the question items, the Cronbach's alpha (α) of the tests is 0.79; it can be concluded that the test items are reliable. However, after analyzing the facility value and discrimination index of each item, we have decided to revise two items, which are most difficult item and the easiest one (the items which do not have a discrimination index value). The revised items occurred in the item number 6 and 12. Item number 6 has facility value of 0.03, which indicates that it is quite difficult. As a result, we have revised the multiple-choice options only. Meanwhile, for item number 12, we have revised the question element of the test item since the item facility value is too easy (IF = 1). The revisions made are as follows. IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 275 Revision 1 (multiple choice) 6. Mengapa Museum Bahari masih ditutup pasca kebakaran? (Why was the Museum of Bahari still closed after the fire?) (A) karena hanya akan ada kegiatan bersih-bersih (because there will only be clean-up activities) (B) karena di bagian dalam masih terpasang garis polisi (because the area inside is still applied the police line) (C) karena akan ada investigasi lanjutan dari pihak kepolisian (because there will be further investigation from the police) (D) karena untuk kepentingan penyelidikan dan pengamanan dari warga (for the purposes of the investigation and security from people) (The correct answer is D, but the most chosen answer is A, 56.3%) The item number 6 is considered as very difficult item (IF = 0.03). Moreover, based on the result of distractors item analysis, most test takers have chosen the option A as the answer which is incorrect. After reviewing the item, we can conclude that the item difficulty is due to misidentification of the location in the paragraph where the correct answer lies. Thus, we modified option A. Below is the revision of the answer in item number 6. (A) Untuk mencegah warga masuk ke area kebakaran (to prevent people from entering the area of the fire) Revision 2 (whole item) 12. Kapan rencana Museum Bahari akan dibuka kembali? (When will the Museum Bahari plan be reopened?) (A) 16 Januari mendatang (B) 17 Januari mendatang (C) 19 Januari mendatang (D) 20 Januari mendatang Item number 12 is considered as a very easy item (IF=1), and has no discrimination value, nor any distraction item. After reviewing the item, this is caused by the information of the answer being obvious in the text. Therefore, we made a total revision in the question. Here is the total revision for item number 12. 12. Area mana saja yang akan dibuka pada Jumat 19 Januari mendatang? (Which areas will be open on Friday 19 January?) (A) gedung yang terbakar (the burnt building only) (B) gedung yang tidak terbakar IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 276 (unburnt building only) (C) semua area, khususnya gedung yang tidak terbakar (all areas, especially unburnt buildings) (D) semua area, kecuali gedung yang terbakar (all areas, except the burned building) (The answer is C) Discussion In terms of validity of the content, it can be concluded that the content of the test is valid and appropriate because it contains three different types of authentic Indonesian texts, such as exposition, news, and literature texts. In addition, the descriptive analysis indicates that the score of students' distributions are equally at beginner, intermediate, and advanced levels. Although students are in the same class, Indonesian 3, the students’ background experience in learning Indonesian is varied, such as less than 1 year, 2 to 5 years, and more than 6 years. The item facility value analysis shows that the test items developed already have a balanced difficulty: easy, moderate, and difficult. However, there is one item considered very easy, and one very difficult. In addition, we also found a good discrimination index value in the test items that we have designed. However, there is one item that did not appear in SPSS, because it has 0 value of discrimination index. Based on the analysis of item facility and discrimination index values, we have revised to the two question items, item number 6 and 12. The item that contributes to the students’ failure can be considered as quite a difficult item for the test takers. It could be argued that question 6 assesses the test takers' ability to make an inference from the text, provided that the answer is not explicitly mentioned in the text. However, in the distractors, there is one option (option A) that has a close relationship to the question. Hence, students may think that the answer is the aforementioned distractor. Meanwhile, question 12 contains an obvious answer which requires test takers to choose the date of an event. This is very easy since the answer option mentioning the event date is clearly matched with that written in the text. Furthermore, the findings in this study also have implications for the discussion of the format of reading tests. The study refutes the Pyrczak’s (1975) argument which states that the use of multiple-choice type questions leads to a lack of dependence on reading the passage on the part of test takers. In fact, in this test material, students need to read the passage to find answers that match with the information and context. Moreover, the study also supports the findings of Baghaei and Ravand (2015) which suggest that multiple choice formats can trigger cognitive processes and learners’ comprehension to choose the most appropriate answer. Furthermore, regarding its characteristics, the reading test designed is also in line with the development theory of tests by Hughes (2003) and Brown (2004), which describes standardized stages. In addition, this study also supports the findings of Boyaci and Guner (2018) which states that the use of authentic materials has an impact on students' reading comprehension IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 277 and also has positive responses from students. This test provided authentic material by way of three different types of text. Furthermore, with varying levels of test difficulty, this test also supports the argument of Wilson (2007) suggesting that the text of reading should be challenging so that students can feel achievement in answering the test. Regarding BIPA teachers, this test also complements the finding of Kamgar and Jadidi (2016), related to the contributions for foreign language teachers when developing evaluation tests. To evaluate their students, teachers of Bahasa Indonesia can prepare questions items that refer to the standards contained in the instruments developed in this study. Conclusion The development of tests to measure proficiency in Bahasa Indonesia for NNS is necessary. This research focuses only on the aspect of reading comprehension with multiple choice formats because the characteristics of this type of test are commonly used for the large-scale testing. In this study several steps have been undertaken, such as designing tests, piloting, administering tests, analyzing test items, and revision based on the results of analysis. This test can be used by institutions to measure the level of proficiency in reading of NNS of Bahasa Indonesia. In addition, this test is also likely to be used by teachers and universities to determine student class placement in a class for university. Development of reading tests with a greater number of questions, for example, 40 items would also appear warranted. In this study, a set of test items for Indonesian language proficiency was developed through a piloting process. The content validity shows that the questions are valid and reliable. However, no statistical procedures have been undertaken to measure the validity of the items during the piloting process. As a result, we have found that some items need to be revised after we administered the test. Based on calculations of facility value, one question contributed to the test takers' failure. In addition, one question was considered as a very easy question and which all test takers could answer correctly. Thus, to create a more reliable test, it is recommended that during piloting the item not only be evaluated by seeking feedback from experts but also by undertaking statistical measurement. References Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. Baghaei, P., & Ravand, H. (2015). A cognitive processing model of reading comprehension in English as a foreign language using the linear logistic test model. Learning and Individual Differences, 43, 100-105. Belet Boyaci, S. D., & Güner, M. (2018). The impact of authentic material use on development of the reading comprehension, writing skills and motivation in language course. International Journal of Instruction, 11(2), 351-368. Bernhardt, E. (1983). Testing foreign language reading comprehension: The immediate recall protocol. Die Unterrichtspraxis / Teaching German, 16(1), 27-33. IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 278 Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press. Brown, J. D. (2004). Language assessment: Principles and classroom practices. White Plains: Pearson Education, Inc. Cranney, A. G. (1972). The construction of two types of cloze reading tests for college students. Journal of Reading Behavior, 5(1), 60-64. Farhady, H. (2012). Principles of Language Assessment. New York: Longman Inc. Gorjian, B. (2013). The effect of passage content on multiple-choice reading comprehension test. Procedia-Social and Behavioral Sciences, 84, 160-164. Gorsuch, G., & Taguchi, E. (2008). Repeated reading for developing reading fluency and reading comprehension: The case of EFL learners in Vietnam. System, 36(2), 253-278. Gorsuch, G., & Taguchi, E. (2010). Developing reading fluency and comprehension using repeated reading: Evidence from longitudinal student reports. Language Teaching Research, 14(1), 27-59. Heaton, J. B. (1988). Writing English language tests. New York: Longman Inc. Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University Press. Jones, R. L. (1977). Testing: A vital connection. The language connection: from the classroom to the world. ACTFL Foreign Language Education Series, 9. Kamgar, N., & Jadidi, E. (2016). Exploring the relationship of Iranian efl learners’ critical thinking and self-regulation with their reading comprehension ability. Procedia-Social and Behavioral Sciences, 232, 776-783. Keenan, J. M., Betjemann, R. S., & Olson, R. K. (2008). Reading comprehension tests vary in the skills they assess: Differential dependence on decoding and oral comprehension. Scientific Studies of Reading, 12(3), 281-300. Nindyaningrum, F. W. (2018). Pengembangan instrumen asesmen uji kemahiran membaca bagi penutur asing. Thesis, Master Program, Universitas Malang. Pyrczak, F. (1975). Passage-dependence of reading comprehension questions: Examples. Journal of reading, 18(4), 308-311. Rahmiati, I. I., & Emaliana, I. (2017). Developing reading test using lower to higher order of thinking for esp students. Language in India, 17(11). 124-144. Saifudin, M. F., Suwandi, S., & Setiawan, B. (2014). Pengembangan model tes kompetensi berbahasa Indonesia. Thesis, Universitas Muhammadiyah Surakarta. Sellers, V. D. (2000). Anxiety and reading comprehension in Spanish as a foreign language. Foreign Language Annals, 33(5), 512-520. Shohamy, E. G. (1981). The cloze procedure and its applicability for testing Hebrew as a foreign language. Stanford University / University of California, Berkeley, 101-114 Taguchi, E., Gorsuch, G., Takayasu-Maass, M., & Snipp, K. (2012). Assisted repeated reading with an advanced-level Japanese EFL reader: A longitudinal diary study. Reading in a Foreign Language, 24(1), 30-55. IRJE | Vol. 3 | No. 2| Year 2019 |E-ISSN: 2580-5711 279 Taguchi, E., Takayasu-Maass, M., & Gorsuch, G. J. (2004). Developing reading fluency in EFL: How assisted repeated reading and extensive reading affect fluency development. Reading in a Foreign Language, 16(2), 70-96. Weir, C. J. (1990). Communicative language testing. New York: Prentice Hall. Wilson, K. (2016). Critical reading, critical thinking: Delicate scaffolding in English for academic purposes (EAP). Thinking Skills and Creativity 22, 256-265. Biographical notes ANDIKA EKO PRASETIYO is a fulltime student at Melbourne University, Master of Applied Linguistics program. He holds a Bachelor of Education, concentration Indonesian Language and Literature Education at Universitas Negeri Semarang, Indonesia. His research interest includes Indonesian education, language testing, and educational technology. Email: andikaekop@gmail.com mailto:andikaekop@gmail.com