JUDUL DALAM BAHASA INDONESIA, DITULIS DENGAN HURUF TNR-14 BOLD, MAKSIMAL 14 KATA, RATA KIRI Research and Evaluation in Education Journal e-ISSN: 2460-6995 Research and Evaluation in Education Journal Volume 1, Number 1, June 2015 (100-113) Available online at: http://journal.uny.ac.id/index.php/reid MODIFIED ROBUST Z METHOD FOR EQUATING AND DETECTING ITEM PARAMETER DRIFT 1) Rahmawati; 2) Djemari Mardapi 1) Center of Educational Assessment, Indonesia; 2) Yogyakarta State University, Indonesia 1) rahmapepuny2011@gmail.com; 2) djemarimardapi@gmail.com Abstract This study is aimed at: (1) revising the criterion used in Robust Z Method for detecting Item Parameter Drift (IPD), (2) identifying the strengths and weaknesses of the modified Robust Z Method, and (3) investigating the effect of IPD on examinees’ classification consistency using empirical data. This study used two types of data. The simulated data were in the form of responses of 20,000 students on 40 dichotomous items generated by simulating six variables including: (1) ability distribution, (2) differences of groups’ ability between groups, (3) type of drifting, (4) magnitude of drifting, (5) anchor test length, and (6) number of drifting items. The empirical data was 4,187,444 students’ response of UN SD/MI 2011 who administered 41 test forms of Indonesian language, mathematics, and science. Modified Robust Z method was used to detect IPD and the IRT true score equating method was used to analyze the classification consistency. The results of this study show that: (1) the criterion of 0.5 point raw score TCC difference leads to 100% consistency on passing classification, (2) the modified Robust Z is accurate to detect the b and ab- drifting when the minimal length of anchor test is 25%, (3) IPD occurring on empirical data affected the passing status of more than 2,000 students. Keywords: Robust Z Method, Item Parameter Drift, IRT True Score Equating mailto:rahmapepuny2011@gmail.com mailto:djemarimardapi@gmail.com Research and Evaluation in Education Journal Modified Robust Z method for equating... - 101 Rahmawati & Djemari Mardapi Introduction The use of multiple test forms which is considered as parallel is widely implemented recently. Multiple test forms are used due to the test security, and to prevent the examinees from cheating easily to others. The other reason of designing parallel test forms is minimalizing the chance of practicing the test. If the administration of the test can be taken twice or more by a particular examinee, then using similar test form would kame the item get exposed frequently, the examinee may recall and practice the items. Although the test is designed to be parallel, it is so hard to have the multiple test forms are perfectly parallel. Different item will have different level of difficulty, regardless similar resources of item’s specification. The difference level of items’ difficulties can raise unfair issues. The less difficulty test form will advantage the examinee who took the form, while examinee who took the more difficult item will get less score not caused by less ability. Thus, comparing the score between groups who took different test forms will lead to a bias result. Non Equivalent Anchor Test (NEAT) design is a way to design parallel test forms, so that the difference of difficulty levels also the difference of groups’ ability can be adjusted. The adjustment of differences is determined bu ancor items. Example of national test that using NEAT design is National Exam (NE) for elementary schools (ES) and Madrasah Ibtidaiyah/MI (Islamic- based elementary school) which is familiarly named as NE ES/MI. UN SD/MI items are constructed by provincial item writing team. All province used the same test specification and items’ indicators. Each province then has their own test which differ from one province to others. To maintain the function of the test as a national measurement tool, 25% of the items were removed and replaced by national anchor items. The national ancor items were place in the same order, and preserve exactly the same content, format, even layout. No changes on national anchor items were allowed. All provinces had to make sure similarity of the anchor items. The anchor items have a very important role. The accuracy of test form’s difficulty level and the accuracy of examinee’s ability estimation depend on the quality of anchor items. The score on anchor items defines the difference of groups’ ability. A group which gets higher score on anchor items is considered as having better ability. Based on the ancor items’ property, the difference of test form’ level of difficulty can be determined and used for scoring adjustment (Cook & Eignor, 1991). Regarding its importance, the anchor items’ parameter should satisfy the measurement invariance assumption. The assumption is that the parameter’s value may shift around the bound of sampling error. Instead of being stable, anchor items’ parameter are not uncommon shifting accross subsample, test administration, or location. These shifting conditions are known as item parameter drift (IPD) and may cause bias on ability estimation. Keller and Wells (2009, p. 6) investigated the impact of drifting anchor items(IPD) on the accuracy of examinees’ ability estimation. The study found that the difference of groups’ ability defines the magnitude of IPD’s impact. Even only one moderate drifting anchor item could give a bias ability estimation. Robust Z method (Hyunh & Meyer, 2010) is a method for detecting drifting items and for fitting linking constants A and B which will be used in scaling process. Robust Z method applies a simple algoritm, yet still presents linking constant that is close to linking constant of the Stocking Lord method. The weaknesses of Robust Z method are its over-sensitivity and the absence of clear cut off criteria (Arce & Lau, 2011). The Robust Z method often detects undrifting anchor items as drifting. The criteria which are used are based on the probability of occurance in a hypothetic distribution; flagging an item as statistically significant IPD does not always mean that the impact of drifting ancors is practically significant. Research and Evaluation in Education Journal 102 - Volume 1, Number 1, June 2015 Regarding the criteria problem, thus, modification of Robust Z method is necessary. The modification which is made is aimed at practically detecting meaningful IPD. Only anchor items which caused significant practical impact will be excluded from scaling process. The modification can give consideration to make decision for either retaining or refining the anchor items. An example of practically meaningful impact is changes on examinees’ classification decision; passing to failing or failing to passing. This study is aimed at: (1) revising the criterion which is employed in Robust Z Method so that the detection of item parameter drift (IPD) can be related to a practically meaningful criterion, (2) identifying the strengths and weaknesses of the modified Robust Z Method in various conditions, and (3) investigating the effect of IPD on the examinees’ classification consistency in real life situation by implementing the modified Robust Z method on empirical data. Research Method Type of Research The research is categorized as a descriptive study. The study described the strengths of modified Ribust Z method, compared to the original version. The study also described the weaknesses of the modified Robust Z method and identified the test’s characteristics which were potential for having ‘practically meaningful’ IPD. The descriptions of IPD’s impact on examinees’ classification in real life situation were also revealed. The real life situation was illustrated by analyzing empiric data using the modified Robust Z method. Time and Location The research took place at Yogyakarta State University, Indonesia, the center of educational assessment, and a province that held item writing workshop for constructing NE ES/MI in the academic year of 2013. The research was conducted in 11 months, starting from March 2013 until February 2014. Population and Sample The population of this study was all students who were enrolled as examinee of NE ES/MI in the academic year of 2011 who took the main tests among all provinces in Indonesia. The main tests are defined as the tests which are administered on the main schedule of NE ES/MI. Students who took repeated session or make up session were excluded from the population. According to the population definition, the total number of the students in the research is 4,187,444. Sample selection in this study was based on the result of cheating validation process. A school is considered as a cheating school if at least one item were identified as being responded identically incorrect by at least 90% students in the school. Identification of cheating school resulted exclusion of all students’ responses of the identified school from the database. This cheating validation process eliminated about 40% of responses and the number of responses remained in the database were: 2,509,646 for bahasa Indonesia test, 2,509,517 for mathematics test, and 2,509,751 for science test. Technical Steps on Modifying Robust Z Method In order to improve the criteria of Robust Z method, the principle of the Difference that Matter (DTM) which is proposed by Brennan (2008, p. 108) at a topic of ‘population invariance’ was used. A way of considering an item as a drifting item is not only a statistical significance but an impact which is caused by the drifting items. How significant the impact is is determined by the researchers. The researchers set the practical impact which was considered as meaningful. In this study, the practical impact which was used to determine wether a drifting item was meaningful or not was the changes on classification consistency. If the detected drifting items made the score test equating changes significantly so causes any examinee classify differently, then the items considered as a practically meaningful IPD. It is suggested to exclude the practically meaningful IPD from scaling process, Research and Evaluation in Education Journal Modified Robust Z method for equating... - 103 Rahmawati & Djemari Mardapi otherwise the decision of examinee classification may disadvantage both the examinee and the user. The Robust Z method consists of several algoritms which, in the end, give the linking constant of A and B. These constants were then used in the scaling process to transform the scale of anchor and non anchor items’ parameter from a focal test form into the same scale as the reference test form. The transformation of the items’ parameter were used to plot the Test Characteristic Curve (TCC). The linking of point to point between TCC focal test and TCC transformed focal test became the conversion table for equating test score. The equated test score was then used to decide wheter an examinee passes or fails in the test. In order to evaluate the IPD impact on modified Robust Z method, Wyse and Reckase (2011) formula was adapted. The formula was used to see the significant difference between TCC total and TCC refinement. TCC total is TCC of transformed focal test that used all anchor items for scaling process. TCC refinement is TCC of transformed focal test that using only non drifting anchor items for scaling process. If the difference between the two TCCs is small, then the impact of IPD on classification consistency can be waived. On the other hand, when the difference is big, then the IPD is practically meaningful and suggested to be excluded from the scaling process. In tis study, the cut off value of 0.5 point ‘raw score’ was used as the maximum difference between the two TCCs. This cut off ensured a hundred percent of classification consistency. Equation (1), (2), (3), and (4) are the formulas which were used in modifying Robust Z method’s citeria. )()( 11     n iTOTAL iTOTAL n iCV iCV PPMax (1)  )(7,1exp1 1 )( ** CVyCVY iCV ba P     (2)  )(7,1exp1 1 )( ** TOTALyTOTALY iTOTAL ba P     (3) aY*CV=aY/ACV dan bY*CV=ACV*bY+BCV (4a) aY*TOT=aY/ATOT dan bY*TOT=ATOT*bY+BTOT (4b) Equation 4a and 4b are formulas which were used to calculate the linking constant of A and B in two different conditions: without refining IPD items(Atot and Btot) and by refining IPD items(Acv and Bcv). Both A and B linking constants were used to scale both anchor items and non anchor items’ parameter. The two kinds of A and B linking constants also lead to two kinds of TCC plots: TCC without refinement (ΣPitotal) and TCC by refining IPD (ΣPicv). The maximum absolute value of the difference between two TCCs was then compared to the DTM cut off value to find out the summary of practically meaningful IPD. Data, Instrument, and Data Collection Empirical data which were used in this research were collected by documentation process. The NE ES/MI of the year of 2011 data were copied from Center for Educational Assessment database. This concludes that the type of the data which was used was secondary data. The collected data were raw responses on the 41 test forms of bahasa Indonesia test, 41 test forms of mathematics test, and 41 test forms of science test. The key of each test form was also collected to complement the raw responses data sets. The instruments which were used in this research was analysis software. There were 5 softwares which were used in this study, namely: WinGen, Bilog-MG, Winstep, R program, and Robust Z Modif. The software functions are: generating response data, validating responses, estimating item parameter, detecting IPD, constructing conversion table, and equating test scores. Research and Evaluation in Education Journal 104 - Volume 1, Number 1, June 2015 Data Analysis Figure 1. Curve of Proportion of Correct Responses on Mathematics Test Using 2.5 Million Responses of NE ES/MI 2011 Examinees The analysis was started by determining the Item Response Theory (IRT) model that would be used. To find out the most suitable model, curves of raw score againts the proportion of students within each group that respond correctly on particular items were manually plotted. Figure 1 is an example of anchor items curve for mathematics test. After deciding the IRT model which was used, simulation study data were generated using WinGen (Han, 2007) software. Each dataset generated was represented responses of 20,000 examinees on 40 dichotomus items. There are six manipulated variables: (1) The percentage of anchor items relative to total number of items (15%, 25%, and 40%); (2) the percentage of drifting items relative of total number of anchor items (15%, 30%, and 45%); (3) the magnitude of drifting. There are two kinds of drifting: the a-parameter drifting (no drifting, moderate drifting of 0.3, and large drifting of 0.7); the b-parameter drifting (no drifting, moderate drifting of 0.5, and large drifting of 0.8); (4) the direction of IPD (symmetrical two direction, one direction); (5) the ability distribution shape (normal and negatively skewed); and (6) comparison of the ability distribution between groups (similar ability distribution and different ability distribution). In total, there are 188 conditions. Each manipulated condition was replicated 50 times for both the reference and the focal groups which resulted analysis of 18,800 datasets. The percentage occurance of manipulated drifting items detected as an IPD named as power rate, the percentage occurance of non manipulated drifting items detected as an IPD named as type I error rate, and the percentage occurance of TCCs differences larger than the cut off value named as DTM rate. The expected results from this study are combination of a high power rate, a low type I. The analysis of empirical data was started with calibration of national anchor items using national responses. The parameter estimated from the national responses was then used as references for calibrating non anchor items in each province. The method which was used to calibrate provincial items is known as fixed item parameter calibration. The similarity of mean and standard deviation between non anchor test and anchor test was used to select the reference test form for equating process. After the reference test form was selected, equating score test of each provincial main test form can be conducted. For each provincial main test form, there are two equating processes: using all anchor items regardless the drifting and using only non drifting anchor items. Based on the two equating processes, each examinee will be classified two times. The classification consistency analysis categories examinees into four groups as follows: (1) passing and keep passing, (2) passing then failing, (3) failing then passing, and (4) failing and keep failing. For each group, the proportion of examinees relative to total number of examinees was calculated. Classification consistency is the sum of proportion of examinees at groups of ‘passing and keep passing’ and ‘failing and keep failing’. The analysis of empirical data also determined the frequency of each anchor item which was detected as an IPD accross 41 test forms. This frequency was named as IPD rate. The anchor item that has high IPD rate needs P ro p o rt io n o f a n sw e ri n g c o rr e ct ly soal 8 soal 10 soal 17 soal 19 soal 24 soal 31 soal 35 soal 36 soal 37 soal 40 Research and Evaluation in Education Journal Modified Robust Z method for equating... - 105 Rahmawati & Djemari Mardapi detail analysis on source of drifting. The expected results from the empirical data are a high percentage of classification consistency and a low IPD rate. Findings and Discussion Results of Simulation Study The result of analysis power rate based on the type of ability distribution is presented in figure 2. The pattern of power rate of normal distribution is similar with the pattern of skewed distribution. Accross different level of drifting magnitude, the type of ability distribution does not present different results. It indicates that the performance of modified Robust Z method is similar with the two types of ability distribution. Figure 2. Power Rate Graph of Type of Ability Distribution accross Different Level of IPD’s Magnitude The modified Robust Z method is accurate when the ability of examinees in one group differs from the other group. Figure 3 and figure 4 are graphs of power rate and type 1 error rate IPD detection on interaction between number of anchor condition and difference of ability among group condition. Figure 3 shows that the modified Robust Z method is accurate when the number of anchor items is 40% and the groups are different in ability. A 100% of power rate means that the modified Robust Z method can detect manipulated drifting items accross all replications. A type 1 error rate close to 0% means that the occurance of detecting IPD incorrectly is almost close to zero. Figure 3. Power Rate Graph of Interaction between Type of Distribution and Ability Differences among Groups, Accross Number of Anchor Items and Type of IPD Figure 4. Type 1 Error Rate Graph of Interaction between Type of Distribution and Ability Differences Among Groups, Accross Number of Anchor Items and Type of IPD The results presented in figure 5 shows that using 40% anchor items can mimimalize the impact of IPD on the classification consistency. The DTM rate for condition of 40% anchor items is close to 0%, not only for the type a-drift but also tyoe b-drift, for both moderate and large level of drifting magnitude. It concludes that designing multiple test forms using 40% of anchor items anticipates the impact of IPD that may arise. Although the anchor test may have an IPD, at least the impact of the IPD to classification consistency can be minimalized. Power Rate normal skewed Power Rate SamaNormal SamaSkewed BedaNormal BedaSkewed Type I Error Rate SamaNormal SamaSkewed BedaNormal BedaSkewed Research and Evaluation in Education Journal 106 - Volume 1, Number 1, June 2015 Figure 5. DTM Rate Graph of Interaction between Type of Distribution and Ability Differences Among Groups, Accross Number of Anchor Items and Type of IPD Table 1. Power rate, Type I error rate, and DTM Rate Based on Anchor Test Length, Number of Drifting Items, and IPD Direction Anchor Test Length Number of Drifting Power rate 15% 10% (one way) 100.0 25% (symmetric) 100.0 25% (one way) 98.3 40% (one way) 17.1 25% 10% (one way) 91.4 25% (one way) 99.4 40% (symmetric) 95.5 40% (one way) 38.6 40% 10% (symmetric) 100.0 10% (one way) 100.0 25% (symmetric) 90.0 25% (one way) 97.1 40% (symmetric) 100.0 40% (one way) 9.5 The IPD detection rate accross different proportion of drifting items shows the weakness of modified Robust Z method as presented in Table 1. Table 1 shows that the power rate of modified Robust Z method is less than 20% in condition number of drifting items is 40% out of total number of anchor items. This finding summarizes that modified Robust Z method is not powerful to detect IPD when the proportion of drifting items in anchor test is big. Large proportion of drifting items makes the anchor items be distributed evenly around the fitting regression line, hiding the facts that many items were drifting. Overall, everything seemed normal and no outlier in the distribution. The modified Robust Z method failed to identify which anchors are drifting and which anchors are not. Table 1 shows that the modified Robust Z method is still accurate in detecting many drifting items as long as the direction of drfiting is symmetric. A symmetric direction means that some items are drifting more difficult, while some others are drifting less difficult. It is shown that when the drifting items number is 40% of anchor test length, the power rate of one way direction is 9.5%, while the power rate of symmetric direction increases dramatically into 100. Figure 6, figure 7, and figure 8 illustrate power rate, type 1 error rate, and DTM rate when direction of IPD distributions are one way and symmetrically two opposite direction. The results show that the modified Robust Z method perfoms better in looking the impact of IPD in test level not only in item level particularly. The practical impact of consistency classification is identified by modified Robust Z method as aggregate of items in test level. Even the number of drifting items were great, but when drifting in an opposite direction, the effect will cancel out and the practical impact can be waived. Figure 6. Power Rate Graph of Interaction Anchor Test length Condition, Number of IPD Condition, Ability Distribution, and IPD Direction. The simulation study results show that the modified Robust Z method improves the DTM Rate SamaNormal SamaSkewed BedaNormal BedaSkewed Power Rate 15%25% 25%40% 40%10% 40%25% 40%40% Research and Evaluation in Education Journal Modified Robust Z method for equating... - 107 Rahmawati & Djemari Mardapi performance of original Robust Z method specifically on test level. The original version cannot give conclusion on the impact caused by some detected drifting items as a part of a test. The original Robust Z method only justifies whether an item is drifting or not. The modified version adds information about the impact of all drifting items on the test score equating. This is similar to complement the analysis of differential item functioning (DIF) with differential test functioning (DTF) analysis. Many DIF items at the end can be waived if the DTF analysis performs no difference. Figure 7. Type 1 Error Rate Graph of Interaction Anchor Test length Condition, Number of IPD Condition, Ability Distribution, and IPD Direction. Figure 8. DTM Rate Graph of Interaction Anchor Test length Condition, Number of IPD Condition, Ability Distribution, and IPD Direction. Results of Empirical Study Table 2, table 3, and table 4 present the item parameter estimation for bahasa Indonesia test, mathematics test, and science test after they were calibrated using the national data. Each table represents the parameter of one test data which were conducted. Table 2. Anchor Items Parameter for Bahasa Indonesia Test Item Code Location Parameter Slope Parameter BIN 18 BIN 20 BIN 21 BIN 22 BIN 23 BIN 25 BIN 27 BIN 31 BIN 32 BIN 35 BIN 36 BIN 37 BIN 40 -2.442 -2.035 -2.752 -2.173 -1.917 -2.640 -2.756 -1.796 -2.438 -1.999 -0.997 -2.186 -2.776 1.791 0.829 0.898 1.147 1.372 1.307 1.309 0.705 1.913 1.304 0.811 0.942 1.164 Table 3. Anchor Items Parameter for Matematics Test Item Code Location Parameter Slope Parameter MAT8 MAT10 MAT17 MAT19 MAT24 MAT31 MAT35 MAT36 MAT37 MAT40 -0.722 -1.622 -1.328 -0.939 -0.888 -0.632 1.479 -0.451 -1.552 -7.492 1.261 0.635 1.350 1.145 1.227 1.283 0.328 0.856 1.144 0.093 Table 4. Anchor Items Parameter for Science Test Item Code Location Parameter Slope Parameter IPA2 IPA3 IPA9 IPA10 IPA18 IPA23 IPA27 IPA29 IPA32 IPA38 -2.869 -2.611 0.027 -1.506 -1.475 -0.536 1.204 -2.225 -0.889 -2.348 0.974 1.102 0.466 0.768 1.337 0.519 0.257 0.843 0.951 1.132 The parameter of anchor test was used to calibrate non anchor items (fixed item Type I Error Rate 15%25% 25%40% 40%10% 40%25% 40%40% DTM Rate 15%25% 25%40% 40%10% 40%25% 40%40% Research and Evaluation in Education Journal 108 - Volume 1, Number 1, June 2015 parameter calibration) to select the best reference for the test form. IPD detection was implemented using modified Robust Z method in over 41 test forms for each subject. Table 5 presents the IPD rate for each anchor item. Table 5. IPD Rate of Each Anchor Items over 41 Test Forms ID % IPD ID % IPD ID % IPD Bin 18 3 Mat 8 18 Ipa 2 45 Bin 20 8 Mat 10 90 Ipa 3 15 Bin 21 35 Mat 17 10 Ipa 9 93 Bin 22 13 Mat 19 13 Ipa 10 85 Bin 23 15 Mat 24 8 Ipa 18 0 Bin 25 48 Mat 31 28 Ipa 23 5 Bin 27 8 Mat 35 53 Ipa 27 46 Bin 31 58 Mat 36 20 Ipa 29 18 Bin 32 0 Mat 37 15 Ipa 32 3 Bin 35 68 Mat 40 100 Ipa 38 18 Bin 36 60 Bin 37 20 Bin 40 3 DTM 73 DTM 95 DTM 93 The results show that in bahasa Indonesia test, there is one anchor item which was detected as IPD, more than 60% anchor items which was detected in more than 85%, while science test has 2 anchor items detected as IPD in more than 85% provinces. The simulation study prooved that the modified Robust Z method has an accurate IPD detection. Then, the result of 85% IPD rate in empirical data means the item is truely drifting items. The anchor items which were detected as drifting items were then taken into consideration while performing scaling process. The drifting items impact determined whether it is practically meaningful or not. Empirical data analysis considers the examinee as passing the test if the score of each subject is at least 4.00. Scoring process was conducted twice: in refinement condition and without refinement condition. For each subject, the examinee will have two passing statuses. Table 6, Table 7, and Table 8 present the examinee status proportion based on the scoring processes. Tabel 6 for bahasa Indonesia subject, Table 7 for mathematics subject, and Table 8 for science subject. Table 6 summarizes analysis results of bahasa Indonesia test’s passing status. Eleven out of 41 test forms used show that IPD does not make the difference of TCCs bigger than DTM criteria’s cut off value. Careful examination on the eleven test forms proved that when the difference is less than DTM cut off value, the classification consistency is 100%. No examinee changes the passing status over two scaling conditions. It concludes that cut off criteria of 0.5 point raw score guarantee 100% classification consistency. Table 7 shows that only one drifting item with large magnitude such as mat 40 has a large impact on classification consistency. The DTM rate for mathematics test is very close to 100%. The number of inconsistent classification at the national level is also very huge, about 25.58 %. This number is equal to 621,600 students regarding the numerous students for Indonesia population. This is a very huge number and significant result. These 621,600 students represent student population in East part of Indonesia. The smallest percentage of inconsistent classification which is persented in Table 8 is 0.05%. This persentage seems small, but considering Indonesian huge population, this percentage is equal to 2050 students that enrolled in NE ES/MI 2011. If those 2050 students are assumed to continue their study in Junior High School/ Madrasah Tsanawiyah (Islamic-based JHS) which has capacity of 100 student, it means that 20 JHS/MTs will have under-quality students to be JHS/MTs students and passed the test just because of the measurement error. Research and Evaluation in Education Journal Modified Robust Z method for equating... - 109 Rahmawati & Djemari Mardapi Table 6 . Percentage of Classification Consistensy of Passing Status Based on Bahasa Indonesia Test Test Form Number of students Pass/Pass Pass/Fail Fail/Pass Fail/Fail DTM BIN_01_P01 61,195 99.74 0 0 0.26 No BIN_01_P02 72,992 99.62 0.07 0 0.31 Yes BIN_02 398,178 99.51 0.09 0 0.40 Yes BIN_03_P01 116,156 99.84 0 0 0.16 No BIN_03_P2/3 289,772 99.88 0.03 0 0.09 Yes BIN_04_P01 43,573 99.95 0 0 0.05 No BIN_05_P01 448,289 99.61 0.08 0 0.31 Yes BIN_06_P01 45,432 97.54 0 0 2.46 No BIN_07_P01 80,311 98.08 0 0.41 1.50 Yes BIN_08_P01 69,929 99.79 0 0 0.21 Yes BIN_09_P01 78,401 99.57 0.09 0 0.34 Yes BIN_10_P01 16,133 98.76 0 0.23 1.01 Yes BIN_11_P01 71,833 99.03 0 0 0.97 Yes BIN_12_P01 98,296 99.44 0.14 0 0.42 Yes BIN_13_P01 66,601 98.41 0.36 0 1.23 Yes BIN_14_P01 25,581 99.19 0 0 0.81 No BIN_15_P01 55,073 99.33 0.16 0 0.51 Yes BIN_16_P01 55,640 99.47 0.11 0 0.42 Yes BIN_17_P01 5,058 45.06 50.04 0 4.90 Yes BIN_18_P01 11,002 98.85 0 0 1.15 Yes BIN_19_P01 19,011 98.81 0 0 1.19 No BIN_19_P02 12,944 99.79 0 0.06 0.15 Yes BIN_20_P01 6,461 98.64 0 0 1.36 Yes BIN_21_P01 5,692 97.86 1.35 0 0.79 Yes BIN_22_P01 25,854 99.97 0 0 0.03 Yes BIN_23_P01 45,069 98.35 0 0.32 1.33 Yes BIN_24_P01 12,966 93.95 0 0 6.05 No BIN_25_P01 16,592 92.19 0 0 7.81 Yes BIN_26_P01 15,232 99.65 0 0 0.35 Yes BIN_28_P01 17,629 99.93 0.02 0 0.05 Yes BIN_29_P01 10,941 99.12 0 0 0.88 No BIN_30_P01 127,834 99.21 0 0 0.79 No BIN_31_P01 21,547 99.94 0.01 0 0.05 Yes BIN_32_P01 10,401 97.66 0 0 2.34 No BIN_33_P01 8,488 94.62 0 0 5.38 No National 2,500,100 97.42 1.32 0.03 1.23 Research and Evaluation in Education Journal 110 - Volume 1, Number 1, June 2015 Table 7 . Percentage of Classification Consistensy of Passing Status Based on Mathematics Test Test Form Number of students Pass/Pass Pass/Fail Fail/Pass Fail/Fail DTM MAT_01_P01 61,194 21.78 68.96 0.00 9.26 YES MAT_01_P02 72,990 31.30 61.24 0.00 7.47 YES MAT_02 398,119 85.47 9.41 0.00 5.12 YES MAT_03_P01 116,119 92.60 0.00 1.62 5.78 YES MAT_03_P02 137,498 94.63 0.00 0.00 5.37 NO MAT_03_P03 152,318 89.60 2.44 0.00 7.96 YES MAT_04_P01 43,569 97.13 0.00 1.41 1.46 YES MAT_05_P01 448,303 92.48 0.00 1.40 6.12 YES MAT_06_P01 45,403 62.37 29.90 0.00 7.73 YES MAT_07_P01 80,314 66.51 23.06 0.00 10.43 YES MAT_09_P01 78,393 0.79 93.84 0.00 5.37 YES MAT_10_P01 16,133 68.62 21.38 0.00 10.00 YES MAT_11_P01 71,822 66.89 28.01 0.00 5.10 YES MAT_12_P01 98,266 60.83 27.88 0.00 11.29 YES MAT_13_P01 66,600 32.40 28.92 0.00 38.68 YES MAT_14_P01 25,581 62.54 31.11 0.00 6.36 YES MAT_15_P01 55,078 72.25 7.49 0.00 20.26 YES MAT_16_P01 55,636 69.45 14.52 0.00 16.02 YES MAT_17_P01 5,058 22.18 57.85 0.00 19.97 YES MAT_18_P01 11,004 69.43 21.21 0.00 9.36 YES MAT_19_P01 19,015 74.55 14.18 0.00 11.26 YES MAT_19_P02 12,949 88.96 9.17 0.00 1.87 YES MAT_19_P03 25,767 78.74 12.93 0.00 8.32 YES MAT_20_P01 6,456 69.05 20.12 0.00 10.83 YES MAT_21_P01 5,692 80.32 16.30 0.00 3.37 YES MAT_22_P01 25,854 97.51 1.54 0.00 0.96 YES MAT_23_P01 45,073 66.86 22.40 0.00 10.73 YES MAT_24_P01 12,966 4.90 26.02 0.00 69.08 YES MAT_25_P01 16,551 45.77 32.56 0.00 21.67 YES MAT_26_P01 15,231 73.29 13.35 0.00 13.35 YES MAT_27_P01 8,252 37.71 36.40 0.00 25.88 YES MAT_28_P01 17,629 14.95 78.37 0.00 6.68 YES MAT_29_P01 10,942 77.19 13.53 0.00 9.29 YES MAT_30_P01 127,831 82.37 14.18 0.00 3.46 YES MAT_31_P01 21,548 25.37 60.24 0.00 14.40 YES MAT_32_P01 10,407 62.27 22.30 0.00 15.43 YES MAT_33_P01 8,488 6.95 81.73 0.00 11.32 YES NATIONAL 2,430,049 62.20 25.58 0.12 12.10 Research and Evaluation in Education Journal Modified Robust Z method for equating... - 111 Rahmawati & Djemari Mardapi Table 8 . Percentage of Classification Consistensy of Passing Status Based on Science Test Test Form Number of students Pass/Pass Pass/Fail Fail/Pass Fail/Fail DTM IPA_01_P01 61,195 95.82 2.83 0.00 1.35 Yes IPA_01_P02 72,988 96.07 2.63 0.00 1.30 Yes IPA_02 398,196 92.70 5.98 0.23 1.09 Yes IPA_03_P01 116,123 99.30 0.00 0.00 0.70 No IPA_03_P02 137,504 99.65 0.12 0.00 0.22 Yes IPA_03_P03 152,321 99.46 0.00 0.00 0.54 Yes IPA_04_P01 43,570 99.86 0.05 0.00 0.10 Yes IPA_05_P01 448,309 98.70 0.00 0.32 0.97 Yes IPA_06_P01 45,444 97.33 0.00 0.71 1.96 Yes IPA_07_P01 80,309 96.16 0.00 1.13 2.71 Yes IPA_08_P01 69,932 99.11 0.30 0.00 0.59 Yes IPA_09_P01 78,421 98.48 0.00 0.77 0.75 Yes IPA_10_P01 16,133 98.76 0.00 0.00 1.24 Yes IPA_11_P01 71,848 99.13 0.00 0.00 0.87 Yes IPA_12_P01 98,279 98.00 0.00 0.00 2.00 Yes IPA_13_P01 66,598 93.94 0.00 0.00 6.06 No IPA_14_P01 25,580 98.32 0.00 0.00 1.68 Yes IPA_15_P01 55,080 94.96 0.00 1.49 3.55 Yes IPA_16_P01 55,639 96.74 0.00 0.00 3.26 Yes IPA_17_P01 5,058 35.11 59.15 0.00 5.73 Yes IPA_18_P01 11,003 96.95 0.79 0.00 2.26 Yes IPA_19_P01 19,015 98.85 0.00 0.00 1.15 Yes IPA_19_P02 12,948 99.86 0.05 0.00 0.08 Yes IPA_19_P03 25,739 98.41 0.50 0.00 1.09 Yes IPA_20_P01 6,457 98.44 0.53 0.00 1.04 Yes IPA_21_P01 5,692 98.84 0.67 0.00 0.49 Yes IPA_22_P01 25,855 99.82 0.00 0.00 0.18 Yes IPA_23_P01 45,073 97.60 0.00 0.71 1.70 Yes IPA_24_P01 12,966 93.38 1.61 0.00 5.01 Yes IPA_25_P01 16,595 90.32 0.00 0.00 9.68 Yes IPA_26_P01 15,232 99.38 0.00 0.00 0.62 Yes IPA_27_P01 8,253 92.88 3.55 0.00 3.57 Yes IPA_28_P01 17,628 98.79 0.96 0.00 0.25 Yes IPA_30_P01 127,839 99.65 0.00 0.00 0.35 Yes IPA_31_P01 21,547 94.10 5.43 0.00 0.48 Yes IPA_32_P01 10,408 97.53 1.40 0.00 1.07 Yes IPA_33_P01 8,494 94.28 0.00 0.00 5.72 No NATIONAL 2,489,271 95.59 2.34 0.14 1.93 A deep attention must be put to answer the results of classification consistency. The table shows that inconsistent classification is mostly in categories of passing, while in fact, the status is failing. This means that thousand even hundred thousands students are decided as passing the test, while in fact, their competencies are still below the standard. This inconsistency has a big influence because the NE score is then used as a selection tool for ebtering secondary schools. The starting point of learning process cannot be in the right starting point. The students Research and Evaluation in Education Journal 112 - Volume 1, Number 1, June 2015 need to repeat or remedy what their lack of for their primary school’s competencies before continuing to a higher level of competency. Summary and Suggestion Summary The analysis proved that external criteria of 0.5 point raw-score TCC difference for modifying Robust Z method can make the modified Robust Z method able to give information about the consequencies of IPD to classification consistency. If the difference of TCC is less than 0.5 point raw-score, then the classification of consistency will be 100%. The modified Robust Z method perfoms better on looking the impact of IPD in test level not only particularly in item level. The practical impact of consistency classification is identified by modified Robust Z method as aggregate of items in test level. Even the number of drifting items were great, but when drifting in an opposite direction, the effect will cancel out and the practical impact can be waived. The implementation of modified Robust Z method in empirical data shows that the impact of IPD was very significant for NE ES/MI 2011 examinees. At least 2000 students were classified as passing, while in fact, their competencies were not sufficient to pass the exam and continue to secondary education. Suggestion The use of multiple test forms is more frequent. Score test equating process has to be performed. The heterogenity of ability accross provinces in Indonesia is also potential for the occurrence of drifting items, which in the study has a big impact on the classification consistency. Regarding to those facts, then the modified Robust Z method is suggested to be used for both detecting drifting items and equating test score, especially when the design of the multiple test form employs set of ancor items, passing classification. The analysis shows that in order to minimalize the effect of IPD on classification consistency, it is suggested to have 40% of anchor test length. This proportion has quite big risk both from the security of anchor items from being too exposed and also less variance items accross provinces. The rule of thumb of anchor test length is 20% (Hambleton, Swaminathan, & Rogers, 1991). To have better prevention of drifting items yet still maintain the item exposure and variability accross provinces, the 40% anchor test length can be constructed in matrix sampling design. Split the anchor tests into several clusters. One cluster to others shares overlapped items. This study also has limitations. The condition simulated in this study is too few to represent all variance of conditions in a real life situation. Then it is suggested to extend this study using broader condition so that the strengths and weaknesses of modified Robust Z method can be comprehensively analyzed. This study also only estimates the impact of drifting items on classification consistenty, and there is no analysis performed to see the performance of modified Robust Z method on ability estimation accuracy or scaling equation accuracy. Thus, a study which employs similar method but focuses on the consequences of ability estimation accuracy is very suggested. The results of ability estimation accuracy or scaling constant accuacy will complement the rsults of this study. References Arce, A. J. & Lau, A. C. (2011). Statistical properties of 3PL Robust Z: An investigation with real and simulated data sets. Paper presented in the Annual Meeting of the National Council on Measurement in Education, in New Orleans, Lousiana. Brennan. (2008). A discussion of population invariance. Applied Psychological Measurement. Volume 32 (1), pp. 102- 114. Cook, L. L. & Eignor, D. R. (1991). IRT equating methods. Educational Research and Evaluation in Education Journal Modified Robust Z method for equating... - 113 Rahmawati & Djemari Mardapi Measurement: Issues and Practice, 10, pp. 37-45. Hambleton, R. K., Swaminathan. H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Han, K. (2007). WINGEN: Windows software that generates IRT parameter and item responses. Applied Psychological Measurement, 31, pp. 457–459. Huynh & Meyer. (2010). Use of Robust Z in detecting unstable items in item response theory models: Practical assessment. Research and Evaluation Electronic Journal, 15 (2). Keller & Wells. (2009). The effect of removing anchor items that exhibit differential item functioning on the scaling and classification of examinees. Paper presented in the annual meeting of NCME, in Denver. Wyse & Reckase. (2011). A graphical approach to evaluating equating using test characteristic curve. Applied Psychological Measurement, 35 (3), pp. 217-231.