Microsoft Word - ETASR_V11_N2_pp6889-6901 Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6889 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple Choice Test According to the Item Response Theory (IRT) Deyab Almaleki Department of Evaluation, Measurement, and Research Umm Al-Qura University Makkah, Saudi Arabia damaleki@uqu.edu.sa Abstract-The aim of the current study is to provide improvement evaluation practices in the educational process. A multiple choice test was developed, which was based on content analysis and the test specification table covered some of the vocabulary of the applied statistics course. The test in its final form consisted of 18 items that were reviewed by specialists in the field of statistics to determine their validity. The results determine the relationship between individual responses and the student ability. Most thresholds span the negative section of the ability. Item information curves show that the items provide a good amount of information about a student with lower or moderate ability compared to a student with high ability. In terms of precision, most items were more convenient with lower ability students. The test characteristic curve was plotted according to the change in the characteristics of the examinees. The information obtained by female students appeared to be more than the information obtained by male students and the test provided more information about students who were not studying statistics in an earlier stage compared with students who did. This test clearly indicated that, based on the level of the statistics course, there should be a periodic review of the tests in line with the nature and level of the course materials in order to have a logical judgment about the level of the students’ progress at the level of their ability. Keywords-item response theory; item characteristics; multiple- choice; psychometric properties I. INTRODUCTION A test is an educational tool that is frequently used to evaluate students' academic achievement and progress. Tests also provide an opportunity to verify students' skills in many educational situations when it is not possible to use other assessment methods. Despite the known problems of indirect measurement, a lot of traits such as mathematical abilities, verbal skills, resistance to stress, intelligence, dissatisfaction, different opinions about a particular topic, etc. cannot be directly observed and measured [1-3]. These are known as the latent traits, and they can be measured only indirectly, often using specially prepared questionnaires where the responses are closely related to the specific traits being studied. Tests are frequently used to assess students' cognitive progress and to build question banks. As a result, the so-called latent trait models have been developed and are used to estimate the parameter values associated with the human personality [4-5]. These models provide a different type of information that in turn helps to develop and improve tests accordingly. Many researchers rely on the information from data that are analyzed as a result of the subjects' responses. But it is important to ask whether the formulation of stimuli (questions) may provide another type of data or information that serves the research process [1, 6-8]. Many researchers and specialists in the field of measurement and evaluation, are interested in the basic concepts and organizational theoretical frameworks of measurement and evaluation and ways to apply them [9-11], due to the great role this science plays in various fields of scientific research in general and educational and psychological research in particular. Research activity according to the needs of educational institutions will positively affect development and improvement in accordance with Saudi Arabia’s Vision 2030. Furthermore, the means of developing tests and measurement methods are extremely important because the data issued from the measurement processes have to be valid and accurate, as some crucial decisions such as admission or promotion may be based upon them [12-16]. In addition, it is the responsibility of specialists in the field of measurement and evaluation to enrich the literature, develop tests used in the educational field, and reduce the possibility of potential measurement errors during the evaluation process [17, 18]. Postgraduate tests in Saudi universities have not yet been subjected to much scrutiny by local and international evaluation institutions, because most of the quality assurance agencies, such as the National Commission for Evaluation and Accreditation (NCAAA), has only recently included postgraduate programs in its plans, providing the opportunity for universities to apply for accreditation for these programs. Midterm and final exams and the way they are administered are some of the indicators used by the NCAAA or other agencies to accurately judge the progress of an academic program. Therefore, improving these tests has become a necessary requirement. Corresponding author: Deyab Almaleki Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6890 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … II. SIGNIFICANCE OF THE CURRENT STUDY The significance of the current study stems from the importance of the evaluation processes in the educational process. The process of improving tests and identifying their psychometric characteristics is the task of those working in the field of measurement and evaluation in order to provide a comprehensive understanding and a deep descriptive analysis of the advantages and disadvantages observed in those tests [19-21]. Furthermore, the scarcity of this type of scientific studies has widened the gap between the tools currently used to measure the level of achievement of students and what is hoped that these tools should be like. The quality of these tools has not been determined or reviewed, and therefore they have not been assessed or evaluated [22-25]. The practical significance of this type of research lies in the use of Item Response Theory (IRT) models in analyzing students' responses to the achievement test in a more objective way, showing whether there is an effect of the multiplicity of characteristics of the participants on the test items (multiple choice) in terms of the accuracy of the estimates of the items’ parameters and the individuals’ ability parameters. It can also guide the composers of the test questions to take into account some points that may affect the psychometric properties of the items and the test and the accuracy of its results [26-29]. III. ITEM RESPONSE THEORY There are many ways (i.e. models) to determine the relationship between individual responses and student ability. Within the framework of modern measurement theory, many models and applications have been formulated and applied to real test data. This includes the measurement assumptions about the characteristics of the test item, the performance of the subject, and how this performance is related to knowledge [27- 37-40]. Tests and evaluation processes in general form the basis of the education system, and their importance lies in improving educational planning, developing a mechanism for enhancing curricular content, measuring learners' competence, and comparing student performance or achievement data. Evaluations also have a role for schools and teachers [30-33]. Tests are a tool of assessment, and their quality depends on a large extent on the nature and quality of the information collected during the preparation of the assessment. Over the decades, the test building system has undergone a lot of development through the emergence of many test building theories focused in many types of tests, such as oral tests, standardized tests, and realistic evaluation. Until today, theories have continued to develop in order to keep up with the changes in policies and new educational practices [2]. The modern theoretical methods were largely developed in the sixties to the late eighties. The IRT is a general statistical theory considering the characteristics of the test item, the subject's performance on the item, and how the performance is related to the abilities that are measured by the test items [12, 34-36]. The IRT provides a rich statistical tool for analyzing educational tests and psychometric measures. The IRT assumes the following: • The test performance of the subjects can be predicted (or explained) by a set of factors called traits or latent traits and abilities. • The relationship between the subject's performance and the properties of the test item can be described through a monotonic increasing function called the item information function. • The response on the test item can be either isolated or continuous and it can be binary or bimodular. Item score categories can be ordered or unordered, and there can be one or many abilities behind the test performance. IV. CHARACTERISTICS OF THE IRT MODELS • The IRT model should be defined as the relationship between the observed response and the unobserved infrastructure (latent trait). • The model should provide a method for estimating the degrees of the latent trait. • The subjects' scores will be the basis for the assessment of the basic construction of the model. • The IRT model assumes that the subject's performance can be predicted or explained by one latent trait or more. In IRT is often assumed that the examinee has some unobservable latent trait (also called latent ability), which cannot be studied directly. The purpose of the IRT is to propose models that allow linking these underlying traits to some of the characteristics that can be observed on the subject [41]. There are many models in the IRT and they have been classified into two types: models that use the cumulative natural curve and logistic models. Logistic models are currently more widespread, they are suitable for two-stage items, and differ according to the number of the estimated item parameters [6, 31, 42-45]. However, there are three commonly used models for binary data, which use (1) for the correct response and (0) for the wrong response, and these models are the one parameter and the two-parameter logistic models, which will be examined below. A. One-Parameter Logistic Model The concept of information availability plays an important role in IRT as it can be used to evaluate how the item included in the test accurately measures the level of the latent trait with (parameter value θi). This latent trait could include, for example, the level of the student's knowledge, intelligence, ability, satisfaction, stress, etc. For example, in educational tests, the item parameter represents the difficulty of the item while the subject parameter represents the ability level of the people being evaluated. The greater the subject's ability in relation to the difficulty of the item (the parameter αj describes the degree of difficulty of the item and the level of influence of the item on the subject), the greater the probability of a correct response to that item. Whereas, when the subject's position on the latent trait is equal to the difficulty of the item, according to Rasch’s model, there is a 0.5 probability that a subject’s response is correct. Accurate information about the value of θi depends on a number of factors, the most important of which is the properties of the questions (items) used to evaluate the parameter (the latent trait) [2, 30, 33, 46]. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6891 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … B. Two-Parameter Logistic Model In the Two-Parameter Logistic (2PL) model, the situation is different from the one in the one-parameter model. The one- parameter model assumes that questions differ only with respect to item difficulty, whereas, in the two-parameter logistic model, two parameters are assumed to be connected to the test item: the parameter αj which describes the difficulty of the item (question) and the additional parameter βj, which describes the discrimination of the item. The parameter β (the slope of the curve) describes the degree to which the question helps to distinguish between the subjects with the highest level of a trait compared to those with a lower level of the same trait. This parameter also shows the extent of the relevance of the item to the overall score of the test. The higher the value of that parameter, the greater the discrimination of the items (and the easier it is to select subjects with a high level and those with a low concentration of the same trait). It should also be noted that the most difficult test item is not necessarily the test item with the highest potential to discriminate between the subjects [2, 19, 36, 47, 48]. C. Three-Parameter Logistic Model The Three-Parameter Logistic (3PL) model is used in IRT, and it determines the probability of a correct response for a dichotomously scored multiple-choice item as a logistic distribution. The 3PL model is an extension of the 2PL logistic model as it introduces the guessing parameter. Items now differ in terms of discrimination, difficulty, and probability of guessing the correct response [47]. After adding the guessing parameter, denoted Ci, in the 3PL model, this parameter is the lower asymptote of the item characteristic curve and represents the probability of subjects who have a low ability to answer the item correctly. The parameter is included in the model to account for item response data from low-ability subjects, where guessing is a factor in test performance [48-50]. The basic equation for the 3PL model is the probability that a randomly selected examinee with a certain proficiency level on scale k will respond correctly to item j, which is characterized by discrimination (αj), difficulty (βj), and guessing probability (Ci) [27, 35, 37, 38, 51]. V. MULTIPLE CHOICE TEST ANALYSIS Understanding how to interpret and use the information based on students' test scores is just as important as knowing how to create a well-designed test. An essential part of building tests is using the feedback from a good test analysis. Among the most important statistical information provided by a good analysis of a multiple-choice test are the following: A. Item Difficulty The test item difficulty factor βj represents the percentage of the respondents who answered the item correctly. The difficulty factor ranges from 0.0 to 1.00. The higher the value of the difficulty factor, the easier the test item is. For example, when the value of the difficulty factor βj is higher than 0.90, the test item is described as very easy and should not be used again in subsequent tests since almost all students are able to properly respond to it. Whereas when the value of the βj is less than 0.20, the test item is described as extremely difficult and should be reviewed in subsequent tests. The optimal test item difficulty factor is 0.50, and it insures maximum discrimination between high and low ability [52-54]. To maximize item discrimination, the desired difficulty levels are slightly higher than halfway between the probability of answering correctly by chance (1.00 divided by the number of alternatives for the item) and the ideal score for the item (1.00) [55-58]. For example, if the test item contains four alternatives to the answer, the probability of answering it correctly by chance would be 0.25 (1.00/4=0.25), and the ideal degree of difficulty for the item can be calculated by substituting in the following rule: ((Ideal score for item - probability of a correct answer by chance) / 2) + probability of a correct answer by chance TABLE I. THE IDEAL DEGREE OF DIFFICULTY GUIDELINE Design of the test item Ideal degree of difficulty for the test item Multiple choice (5 alternatives) 0.60 Multiple choice (4 alternatives) 0.62 Multiple choice (3 alternatives) 0.66 Multiple choice (2 alternatives) 0.75 B. Item Discrimination The test item discrimination factor is referred to using the symbol αj, as it represents the point relationship between the respondent's performance on the item and the respondents' total scores. The discrimination factor value ranges from -1.00 to 1.00. When the value of the test item discrimination factor is high, it indicates that the test item is able to distinguish between respondents. It distinguishes between those who scored high in the tests and were able to answer the test item correctly and those who obtained low test scores and were not able to respond to the item correctly [54, 59]. Test items that have point values close to or less than zero should be removed. Moreover, further consideration should be given to the item which was responded to better by those who generally performed poorly on the test than those who performed better on the test as a whole. The test item may be confusing in some way to top-performing respondents [52, 53, 58, 59]. TABLE II. THE IDEAL DEGREE OF DISCRIMINATION GUIDELINE Discrimination factor value Description of the test item ≥ 0.4 A very good test item 0.3 - 0.39 A good test item. Possible improvements may be considered 0.2 - 0.29 A fairly good test item. It is recommended to improve it ≤ 0.2 A weak test item, with the recommendation of deleting it ≤ 0 It is recommended to directly delete the item VI. PSYCHOMETRIC PROPERTIES OF THE TEST Psychometric properties are a statistical mechanism to verify the fairness, objectivity, and relevance of the test for the phenomenon to be measured. The most important psychometric properties are the following: • The individual data series for all test components. • The characteristics of the data collected for all test components. The fairness of the test lies in its freedom from Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6892 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … any bias and its suitability for the target group, regardless of gender, race, and religion. The psychometric properties of the test are tested to verify that it is objectively constructed and free of any bias. Studying the rules for formulating multiple-choice test items is important because it has an impact on the level of performance on the items or the test as a whole. This means that the good construction of the test and the verification of all its psychometric properties ensures that the test avoids any violations in the structure of the items, which in turn affects the individual's performance on the test items [36, 60-62]. VII. METHODS The descriptive survey method was used to obtain data from a real-life scenario of giving postgraduate-level midterm and final exams to assess master’s students’ achievement level in the subject of applied statistics. This analytical study aimed to determine the quality of the test, its efficiency, and the reliability of its results despite the varying circumstances in which it is given. A. Measurements A Criterion Referenced Test (CRT) was used to evaluate students' achievement in the course of applied statistics in a master’s degree level in order to verify the quality of the test as a tool to evaluate the level of students’ achievement. The test was developed and based on content analysis. The test specification table covered some of the course vocabulary for the applied statistics course. The test in its first form consisted of 25 test items that were reviewed by specialists in the field of statistics to determine their face validity, and 7 items were omitted as a result. So, in its final form, the test had 18 items, and they were applied to the study sample to verify their quality (Table III). The results of the thorough analysis of the test items were handed over to the central question bank in order to compare the performance of the test items and verify their lifespan when re-performing statistical operations on them later. To verify the test reliability, the Kuder Richardson (KR- 20) method was used because the binary data are coded using 0 and 1 after correcting the items, and because the test items differ in their difficulty parameter. The results indicated that the test has a high reliability coefficient of KR-20 = 0.842. B. Sample The current study population consisted of all students of the applied statistics course at the master’s level at Umm Al-Qura University on the main campus in Makkah and all branches of the University. The size of the population was rather large, estimated at about 400 male and female students registered during the second semester of the academic year 2020. It was difficult to reach all the sample members because of the financial cost, time, and effort required. Moreover, the educational and population environment conditions for all students are very similar and the previous studies using samples from students in the Saudi universities’ population did not show any clear bias. Therefore, the current study used a random sample consisting of 338 students, equivalent to the 84.5% of the study population, studying different disciplines in the College of Education. TABLE III. TEST ITEMS Test Items Alternatives (answers) 1. Which of the following is true for the interval scale level? - Classification of individuals - Ranking of individuals and identifying differences - Both 2. If the value of the correlation coefficient is equal to (-0.8) this is an indication that the relationship is - Weak - Nonexistent - Strong 3. Estimates that are calculated by studying the sample members are called - Variables - Parameters - Statistics 4. When the population is homogeneous and its number is very large, we use the following type of sampling - Simple Random Sampling - Stratified Random Sampling - Cluster Random Sampling 5. When studying the relationship between job performance and job satisfaction, job satisfaction is called - Dependent variable - Independent variable - Intruder valuable 6. Which of the following statements is true for the relationship between the sample and the population? - Population parameters are a good estimate of sample statistics - Population parameters are a good estimate of raw scores - Sample statistics are a good estimate of population parameters 7. If the relationship between chronological age and academic achievement is (r = 0.91) then this is evidence of - The greater the age, the greater the achievement - The younger the age, the lesser the achievement - The one or the other 8. If the sum of the squares of the deviations from the mean = 80 and the number of students = 21 then the standard deviation equals - 2 - 4 - 6 9. The range of the relationship that exists between two quantitative variables is called - Slope - Connection - Range 10. The mode for the values (1, 4, 9, 12) is - Zero - 12 - No mode 11. The median of the values (9, 15, 7, 10, 12) is - 7 - 9 - 10 12. In one of the regions in KSA, a study was conducted on the pros and cons of the e-learning system at the undergraduate level. In this study the academic level is - Variable - Constant - Other 13. The measure of central tendency not affected by outliers is - The median - The mode - The one or the other 14. The number of training courses during a whole semester is a - Continuous quantitative variable - Discrete quantitative variable - Descriptive variable 15. Achievement tests depend on the following level of measurement scale - Ordinal - Interval - Ratio 17. When the sample size is increased - It increases the probability of a normal distribution - It makes the sample to not comply with the normal distribution - There is no relation between them 18. Setting the confidence level at 95% is a prerequisite for educational sciences. - Yes - It may take different values according to the nature of the study - Other The sample distribution when responding to the test is represented in Tables IV and V. The members of the study Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6893 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … sample were contacted by e-mail, and a test link was created and made available on the student’s electronic page. The link was made available for one hour, representing the testing period, and prior coordination with the study sample was made to select a time appropriate for everyone. Cases which had any type of technological problems were not recorded. Electronic reminders via the university's electronic system were used before the test to alert the study sample about the test time. TABLE IV. DISTRIBUTION OF THE STUDY SAMPLE ACCORDING TO GENDER AND CURRENT MAJOR G e n d e r Current major in the master's study P sy c h o lo g y I s la m ic E d u c a ti o n C u r r ic u lu m a n d I n s tr u c ti o n E d u c a ti o n a l A d m in is tr a ti o n S p e c ia l E d u c a ti o n T o ta l Male 45 38 37 30 30 180 Female 29 31 29 32 37 158 Total 74 69 66 62 67 338 TABLE V. DISTRIBUTION OF THE STUDY SAMPLE ACCORDING TO STUDYING STATISTICS IN EARLIER STAGES AND CURRENT MAJOR S tu d ie d s ta ti s ti c s i n e a r li e r s ta g e Current major in the master's study P sy c h o lo g y I s la m ic E d u c a ti o n C u r r ic u lu m a n d I n s tr u c ti o n E d u c a ti o n a l A d m in is tr a ti o n S p e c ia l E d u c a ti o n T o ta l Yes 56 51 52 42 34 235 No 18 18 14 20 33 103 Total 74 69 66 62 67 338 VIII. RESULTS Figure 1 show the eigenvalue scree plot. It is clear that the first eigenvalue is much greater than the others, suggesting that a unidimensional model is reasonable for this data. Fig. 1. Eigenvalue scree plot. This test has items with one correct alternative answer that worths a single point, the item difficulty was simply the percentage of students who answer an item correctly, which is equal to the item mean. In this case, the item difficulty index in Tables VI to X shows the ranges, based on the examinees' characteristics. The item difficulty ranged from 0.647 to 0.928. For students who studied statistics in earlier stages, the items' difficulties ranged from 0.878 to 0.970, however for students who did not, the items' difficulties ranged from 0.310 to 0.893, while the items’ difficulties based on gender ranged from 0.650 to 0.966 for male and 0.645 to 0.924 for the female students. Regarding the students' current major, the item difficulty varied over subjects: for special education, it ranged from 0.522 to 0.985, for educational administration it ranged from 0.50 to 0.870, for curriculum and instruction it ranged from 0.651 to 0.999, for Islamic education it ranged from 0.521 to 0.927, and for psychology it ranged from 0.837 to 0.999. For higher GPA the items' difficulties ranged from 0.857 to 0.994, whereas for moderate GPA they ranged from 0.400 to 0.936, and for low GPA they ranged from 0.021 to 0.869. TABLE VI. OVERALL ITEM DIFFICULTY Items Overall Q1 0.837 Q2 0.668 Q3 0.843 Q4 0.751 Q5 0.647 Q6 0.928 Q7 0.798 Q8 0.763 Q9 0.917 Q10 0.857 Q11 0.757 Q12 0.792 Q13 0.784 Q14 0.739 Q15 0.695 Q16 0.825 Q17 0.786 Q18 0.828 TABLE VII. ITEM DIFFICULTY ACCORING TO STUDYING STATISTICS IN AN EARLIER STAGE Items Studying statistics in earlier stages Yes No Q1 0.927 0.631 Q2 0.787 0.398 Q3 0.970 0.553 Q4 0.842 0.543 Q5 0.795 0.310 Q6 0.944 0.893 Q7 0.897 0.572 Q8 0.765 0.757 Q9 0.974 0.786 Q10 0.944 0.660 Q11 0.834 0.580 Q12 0.868 0.621 Q13 0.880 0.563 Q14 0.872 0.436 Q15 0.825 0.398 Q16 0.910 0.631 Q17 0.842 0.660 Q18 0.893 0.679 Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6894 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Figures 2 - 19 present the combined curve for the 18 items based on the overall data. Each item has one threshold. Most thresholds span the negative section of the ability. Item information curves show that the items provide a good amount of information for students who had a lower or moderate ability compared to high ability students. In terms of precision, most of the items were more convenient to lower ability students (e.g. items 9, 10, 17), while items 2, 5, 14 gathered more information for students who had moderate ability. TABLE VIII. ITEMS DIFFICULTY ACCORING TO GENDER Items Gender Male Female Q1 0.844 0.829 Q2 0.661 0.677 Q3 0.872 0.810 Q4 0.761 0.740 Q5 0.650 0.645 Q6 0.933 0.924 Q7 0.827 0.765 Q8 0.688 0.848 Q9 0.966 0.860 Q10 0.861 0.854 Q11 0.772 0.740 Q12 0.816 0.765 Q13 0.850 0.708 Q14 0.777 0.696 Q15 0.755 0.626 Q16 0.888 0.753 Q17 0.850 0.715 Q18 0.838 0.816 TABLE IX. ITEMS DIFFICULTYACCORING TO THE CURRENT MAJOR I te m s Current master's study major S p e c ia l E d u c a ti o n E d u c a ti o n a l A d m in is tr a ti o n C u r r ic u lu m a n d I n s tr u c ti o n I s la m ic E d u c a ti o n P s y c h o lo g y Q1 0.761 0.822 0.939 0.768 0.891 Q2 0.522 0.645 0.712 0.521 0.918 Q3 0.567 0.838 0.999 0.840 0.959 Q4 0.731 0.693 0.848 0.623 0.851 Q5 0.507 0.500 0.651 0.623 0.918 Q6 0.985 0.725 0.954 0.956 0.999 Q7 0.776 0.741 0.772 0.753 0.932 Q8 0.820 0.822 0.893 0.608 0.689 Q9 0.865 0.838 0.969 0.927 0.972 Q10 0.656 0.870 0.984 0.782 0.986 Q11 0.656 0.661 0.939 0.521 0.986 Q12 0.761 0.709 0.954 0.637 0.891 Q13 0.776 0.741 0.787 0.652 0.945 Q14 0.567 0.596 0.863 0.710 0.932 Q15 0.582 0.516 0.742 0.739 0.864 Q16 0.791 0.629 0.878 0.869 0.932 Q17 0.776 0.677 0.772 0.855 0.837 Q18 0.791 0.725 0.893 0.840 0.873 Figure 20 represents the test characteristic curve that was the functional relation between the true score and the ability scale. As we can see the probability of the correct response was near to 1 at the lowest levels of ability and it increased until it came to the highest levels of ability. The probability of correct response in this test is about 18 for high ability students. Figure 21 presents the total amount of the information that has been obtained from the test. It appears clearly that the test gives good indicators to assess lower levels of agreement. TABLE X. ITEM DIFFICULTY ACCORING TO THE GPA Items GPA ≥ 3.75 3.30-3.75 2.5 – 3.30 Q1 0.950 0.809 0.456 Q2 0.950 0.427 0.130 Q3 0.989 0.872 0.195 Q4 0.928 0.581 0.456 Q5 0.956 0.400 0.021 Q6 0.994 0.845 0.869 Q7 0.939 0.700 0.478 Q8 0.857 0.645 0.673 Q9 0.983 0.936 0.608 Q10 0.939 0.854 0.304 Q11 0.972 0.554 0.282 Q12 0.956 0.627 0.543 Q13 0.956 0.627 0.478 Q14 0.972 0.627 0.086 Q15 0.939 0.527 0.130 Q16 0.950 0.745 0.521 Q17 0.923 0.700 0.456 Q18 0.934 0.781 0.521 Fig. 2. Characteristic curve of item 1. Fig. 3. Characteristic curve of item 2. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6895 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Fig. 4. Characteristic curve of item 3. Fig. 5. Characteristic curve of item 4. Fig. 6. Characteristic curve of item 5. Fig. 7. Characteristic curve of item 6. Fig. 8. Characteristic curve of item 7. Fig. 9. Characteristic curve of item 8. Fig. 10. Characteristic curve of item 9. Fig. 11. Characteristic curve of item 10. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6896 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Fig. 12. Characteristic curve of item 11. Fig. 13. Characteristic curve of item 12. Fig. 14. Characteristic curve of item 13. Fig. 15. Characteristic curve of item 14. Fig. 16. Characteristic curve of item 15. Fig. 17. Characteristic curve of item 16. Fig. 18. Characteristic curve of item 17. Fig. 19. Characteristic curve of item 18. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6897 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Fig. 20. Test characteristic curve between the true score and the ability scale. Fig. 21. Test characteristic curve. Fig. 22. Test information curve according to male examinees. Figures 22 to 33 represent the test characteristic curve according to the change in the examinees’ characteristics. The amount of information that has been obtained by female students appeared to be more than the information extracted by male and the test provided more information about students who were not studying statistics in earlier levels compared to students who were. Figures 27 and 28 also confirm that the test provides a good amount of information for students who had a lower or moderate ability compared to students who had high ability. Fig. 23. Test information curve according to female examinees. Fig. 24. Test information curve according to examinees who were studying statistics in earlier stage. Fig. 25. Test information curve according to examinees who were not studying statistics in an earlier stage. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6898 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Fig. 26. Test information curve according to examinees who had a GPA of 3.75 and above. Fig. 27. Test information curve according to examinees who had a GPA in the range from 3.30 to less than 3.75. Fig. 28. Test information curve according to examinees who had a GPA in the range from 2.50 to less than 3.30. Fig. 29. Test information curve according to current major (Special Education). Fig. 30. Test information curve according to current major (Educational Administration). Fig. 31. Test information curve according to current major (Curriculum and Instruction). Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6899 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … Fig. 32. Test information curve according to current major (Islamic Education). Fig. 33. Test information curve according to current major (Psychology). IX. DISCUSSION The study findings support previous researches [5-7, 63] for the use of IRT models in analyzing students' responses to an achievement test in a more objective way, showing whether there is an effect of the multiplicity of characteristics of the participants on the test items (multiple choice) in terms of the accuracy of the estimates of the items’ parameters and the individuals’ ability parameters. The results also help determine the relationship between individual responses and their basic ability. Within the framework of modern measurement theory, many models and applications have been formulated and applied to real test data. This includes the measurement assumptions about the characteristics of the test items, the performance of the subjects, and how performance is related to knowledge. This test clearly indicated that based on the level of the statistics course, there should be a periodic review of the tests in line with the nature and level of the course materials in order to have a logical judgment about the level of students’ progress and the level of their ability. This conclusion is consistent with the findings of [50, 64]. X. CONCLUSION In general, this study seeks to further improve the evaluation practices in the educational process. The tests that describe the students' progress in the educational process should be subject to review by evaluation and measurement specialists in order to ensure that we have valid and reliable evaluation tools. The administrators of the educational system need to find a mechanism to review question banks and align them with the requirements of the scientific material of the courses that are subject to continuous development. REFERENCES [1] B. Zhuang, S. Wang, S. Zhao, and M. Lu, "Computed tomography angiography-derived fractional flow reserve (CT-FFR) for the detection of myocardial ischemia with invasive fractional flow reserve as reference: systematic review and meta-analysis," European Radiology, vol. 30, no. 2, pp. 712–725, Feb. 2020, https://doi.org/10.1007/s00330- 019-06470-8. [2] Y. A. Wang and M. Rhemtulla, "Power Analysis for Parameter Estimation in Structural Equation Modeling:A Discussion and Tutorial," in Advances in Methods and Practices in Psychological Science, California, USA: University of California, 2020. [3] H. Zhu, W. Gao, and X. Zhang, "Bayesian Analysis of a Quantile Multilevel Item Response Theory Model," Frontiers in Psychology, vol. 11, Jan. 2021, Art. no. 607731, https://doi.org/10.3389/fpsyg. 2020.607731. [4] M. R. Szeles, "Examining the foreign policy attitudes in Moldova," PLOS ONE, vol. 16, no. 1, 2021, Art. no. e0245322, https://doi.org/ 10.1371/journal.pone.0245322. [5] D. Almaleki, "The Precision of the Overall Data-Model Fit for Different Design Features in Confirmatory Factor Analysis," Engineering, Technology & Applied Science Research, vol. 11, no. 1, pp. 6766–6774, Feb. 2021, https://doi.org/10.48084/etasr.4025. [6] D. Almaleki, "Empirical Evaluation of Different Features of Design in Confirmatory Factor Analysis," Ph.D. dissertation, Western Michigan University, MC, USA, 2016. [7] C. S. Wardley, E. B. Applegate, A. D. Almaleki, and J. A. Van Rhee, "A Comparison of Students’ Perceptions of Stress in Parallel Problem- Based and Lecture-Based Curricula," The Journal of Physician Assistant Education, vol. 27, no. 1, pp. 7–16, Mar. 2016, https://doi.org/10.1097/ JPA.0000000000000060. [8] C. Wardley, E. Applegate, A. Almaleki, and J. V. Rhee, "Is Student Stress Related to Personality or Learning Environment in a Physician Assistant Program?," The Journal of Physician Assistant Education, vol. 30, no. 1, pp. 9–19, Mar. 2019, https://doi.org/10.1097/JPA. 0000000000000241. [9] A. C. Villa Montoya et al., "Optimization of key factors affecting hydrogen production from coffee waste using factorial design and metagenomic analysis of the microbial community," International Journal of Hydrogen Energy, vol. 45, no. 7, pp. 4205–4222, Feb. 2020, https://doi.org/10.1016/j.ijhydene.2019.12.062. [10] N. M. Moo-Tun, G. Iniguez-Covarrubias, and A. Valadez-Gonzalez, "Assessing the effect of PLA, cellulose microfibers and CaCO3 on the properties of starch-based foams using a factorial design," Polymer Testing, vol. 86, Jun. 2020, Art. no. 106482, https://doi.org/ 10.1016/j.polymertesting.2020.106482. [11] K. M. Marcoulides, N. Foldnes, and S. Grønneberg, "Assessing Model Fit in Structural Equation Modeling Using Appropriate Test Statistics," Structural Equation Modeling: A Multidisciplinary Journal, vol. 27, no. 3, pp. 369–379, May 2020, https://doi.org/10.1080/10705511. 2019.1647785. [12] M. D. H. Naveiras, "Using Auxiliary Item Information in the Item Parameter Estimation of a Graded Response Model for a Small to Medium Sample Size: Empirical versus Hierarchical Bayes Estimation," Ph.D. dissertation, Vanderbilt University, Nashville, TN, USA, 2020. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6900 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … [13] M. N. Morshed, M. N. Pervez, N. Behary, N. Bouazizi, J. Guan, and V. A. Nierstrasz, "Statistical modeling and optimization of heterogeneous Fenton-like removal of organic pollutant using fibrous catalysts: a full factorial design," Scientific Reports, vol. 10, no. 1, Sep. 2020, Art. no. 16133, https://doi.org/10.1038/s41598-020-72401-z. [14] W. van Lankveld, R. J. Pat-El, N. van Melick, R. van Cingel, and J. B. Staal, "Is Fear of Harm (FoH) in Sports-Related Activities a Latent Trait? The Item Response Model Applied to the Photographic Series of Sports Activities for Anterior Cruciate Ligament Rupture (PHOSA- ACLR)," International Journal of Environmental Research and Public Health, vol. 17, no. 18, Sep. 2020, Art. no. 6764, https://doi.org/ 10.3390/ijerph17186764. [15] C. Shin, S.-H. Lee, K.-M. Han, H.-K. Yoon, and C. Han, "Comparison of the Usefulness of the PHQ-8 and PHQ-9 for Screening for Major Depressive Disorder: Analysis of Psychiatric Outpatient Data," Psychiatry Investigation, vol. 16, no. 4, pp. 300–305, Apr. 2019, https://doi.org/10.30773/pi.2019.02.01. [16] C. W. Ong, B. G. Pierce, D. W. Woods, M. P. Twohig, and M. E. Levin, "The Acceptance and Action Questionnaire – II: an Item Response Theory Analysis," Journal of Psychopathology and Behavioral Assessment, vol. 41, no. 1, pp. 123–134, Mar. 2019, https://doi.org/ 10.1007/s10862-018-9694-2. [17] A. Acevedo-Mesa, J. N. Tendeiro, A. Roest, J. G. M. Rosmalen, and R. Monden, "Improving the Measurement of Functional Somatic Symptoms With Item Response Theory," Assessment, Aug. 2020, Art. no. 1073191120947153, https://doi.org/10.1177/1073191120947153. [18] J. Xia, Z. Tang, P. Wu, J. Wang, and J. Yu, "Use of item response theory to develop a shortened version of the EORTC QLQ-BR23 scales," Scientific Reports, vol. 9, no. 1, Feb. 2019, Art. no. 1764, https://doi.org/10.1038/s41598-018-37965-x. [19] Y. Liu and J. S. Yang, "Interval Estimation of Latent Variable Scores in Item Response Theory," Journal of Educational and Behavioral Statistics, vol. 43, no. 3, pp. 259–285, Jun. 2018, https://doi.org/10.3102/ 1076998617732764. [20] U. Gromping, "Coding invariance in factorial linear models and a new tool for assessing combinatorial equivalence of factorial designs," Journal of Statistical Planning and Inference, vol. 193, pp. 1–14, Feb. 2018, https://doi.org/10.1016/j.jspi.2017.07.004. [21] P. J. Ferrando and U. Lorenzo-Seva, "Assessing the Quality and Appropriateness of Factor Solutions and Factor Score Estimates in Exploratory Item Factor Analysis," Educational and Psychological Measurement, vol. 78, no. 5, pp. 762–780, Oct. 2018, https://doi.org/10.1177/0013164417719308. [22] X. An and Y.-F. Yung, "Item Response Theory: What It Is and How You Can Use the IRT Procedure to Apply It," SAS Institute Inc., Paper SAS364-2014. [23] K. Coughlin, "An Analysis of Factor Extraction Strategies: A Comparison of the Relative Strengths of Principal Axis, Ordinary Least Squares, and Maximum Likelihood in Research Contexts that Include both Categorical and Continuous Variables," Ph.D. dissertation, University of South Florida, Tampa, FL, USA, 2013. [24] D. L. Bandalos and P. Gagne, "Simulation methods in structural equation modeling," in Handbook of structural equation modeling, New York, ΝΥ, USA: The Guilford Press, 2012, pp. 92–108. [25] J. C. F. de Winter, D. Dodou, and P. A. Wieringa, "Exploratory Factor Analysis With Small Sample Sizes," Multivariate Behavioral Research, vol. 44, no. 2, pp. 147–181, Apr. 2009, https://doi.org/10.1080/ 00273170902794206. [26] J. D. Kechagias, K.-E. Aslani, N. A. Fountas, N. M. Vaxevanidis, and D. E. Manolakos, "A comparative investigation of Taguchi and full factorial design for machinability prediction in turning of a titanium alloy," Measurement, vol. 151, Feb. 2020, Art. no. 107213, https://doi.org/10.1016/j.measurement.2019.107213. [27] G. Kuan, A. Sabo, S. Sawang, and Y. C. Kueh, "Factorial validity, measurement and structure invariance of the Malay language decisional balance scale in exercise across gender," PLOS ONE, vol. 15, no. 3, 2020, Art. no. e0230644, https://doi.org/10.1371/journal.pone.0230644. [28] M. J. Allen and W. M. Yen, Introduction to measurement theory. Monterey, CA, USA: Cole Publishing, 1979. [29] O. P. John and S. Srivastava, "The Big Five trait taxonomy: History, measurement, and theoretical perspectives," in Handbook of personality: Theory and research, New York, NY, USA: Guilford Press, 1999, pp. 102–138. [30] S.-H. Joo, L. Khorramdel, K. Yamamoto, H. J. Shin, and F. Robin, "Evaluating Item Fit Statistic Thresholds in PISA: Analysis of Cross- Country Comparability of Cognitive Items," Educational Measurement: Issues and Practice, Nov. 2020, https://doi.org/10.1111/emip.12404. [31] H. Bourdeaud’hui, "Investigating the effects of presenting listening test items in a singular versus dual mode on students’ critical listening performance," in Upper-primary school students’ listening skills: Assessment and the relationship with student and class-level characteristics, Ghent, Belgium: Ghent University, 2019. [32] D. M. Dimitrov and Y. Luo, "A Note on the D-Scoring Method Adapted for Polytomous Test Items," Educational and Psychological Measurement, vol. 79, no. 3, pp. 545–557, Jun. 2019, https://doi.org/ 10.1177/0013164418786014. [33] J. Suarez-Alvarez, I. Pedrosa, L. Lozano, E. Garcia-Cueto, M. Cuesta, and J. Muniz, "Using reversed items in Likert scales: A questionable practice," Psicothema, vol. 30, no. 2, pp. 149–158, 2018, https://doi.org/ 10.7334/psicothema2018.33. [34] J. P. Lalor, H. Wu, and H. Yu, "Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds," in Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, Nov. 2019, pp. 4240–4250, https://doi.org/10.18653/v1/D19-1434. [35] B. Couvy-Duchesne, T. A. Davenport, N. G. Martin, M. J. Wright, and I. B. Hickie, "Validation and psychometric properties of the Somatic and Psychological HEalth REport (SPHERE) in a young Australian-based population sample using non-parametric item response theory," BMC Psychiatry, vol. 17, no. 1, Aug. 2017, Art. no. 279, https://doi.org/ 10.1186/s12888-017-1420-1. [36] P. M. Bentler and D. G. Bonett, "Significance tests and goodness of fit in the analysis of covariance structures," Psychological Bulletin, vol. 88, no. 3, pp. 588–606, 1980, https://doi.org/10.1037/0033-2909.88.3.588. [37] A. Schimmenti, L. Sideli, L. L. Marca, A. Gori, and G. Terrone, "Reliability, Validity, and Factor Structure of the Maladaptive Daydreaming Scale (MDS–16) in an Italian Sample," Journal of Personality Assessment, vol. 102, no. 5, pp. 689–701, Sep. 2020, https://doi.org/10.1080/00223891.2019.1594240. [38] C.-Y. Lin, V. Imani, M. D. Griffiths, and A. H. Pakpour, "Validity of the Yale Food Addiction Scale for Children (YFAS-C): Classical test theory and item response theory of the Persian YFAS-C," Eating and Weight Disorders - Studies on Anorexia, Bulimia and Obesity, Jul. 2020, https://doi.org/10.1007/s40519-020-00956-x. [39] L. Jiang et al., "The Reliability and Validity of the Center for Epidemiologic Studies Depression Scale (CES-D) for Chinese University Students," Frontiers in Psychiatry, vol. 10, 2019, Art. no. 315, https://doi.org/10.3389/fpsyt.2019.00315. [40] S. Doi, M. Ito, Y. Takebayashi, K. Muramatsu, and M. Horikoshi, "Factorial validity and invariance of the Patient Health Questionnaire (PHQ)-9 among clinical and non-clinical populations," PLOS ONE, vol. 13, no. 7, 2018, Art. no. e0199235. [41] T. Tsubakita, K. Shimazaki, H. Ito, and N. Kawazoe, "Item response theory analysis of the Utrecht Work Engagement Scale for Students (UWES-S) using a sample of Japanese university and college students majoring medical science, nursing, and natural science," BMC Research Notes, vol. 10, no. 1, Oct. 2017, Art. no. 528, https://doi.org/10.1186/ s13104-017-2839-7. [42] S. C. Smid, D. McNeish, M. Miocevic, and R. van de Schoot, "Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts: A Systematic Review," Structural Equation Modeling: A Multidisciplinary Journal, vol. 27, no. 1, pp. 131–161, Jan. 2020, https://doi.org/10.1080/10705511.2019.1577140. [43] M. K. Cain and Z. Zhang, "Fit for a Bayesian: An Evaluation of PPP and DIC for Structural Equation Modeling," Structural Equation Modeling: A Multidisciplinary Journal, vol. 26, no. 1, pp. 39–50, Jan. 2019, https://doi.org/10.1080/10705511.2018.1490648. Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6889-6901 6901 www.etasr.com Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple … [44] D. Garson, "StatNotes: Topics in Multivariate Analysis," North Carolina State University. https://faculty.chass.ncsu.edu/garson/PA765/statnote. htm (accessed Feb. 10, 2021). [45] H. W. Marsh, K.-T. Hau, and D. Grayson, "Goodness of Fit in Structural Equation Models," in Contemporary psychometrics: A festschrift for Roderick P. McDonald, Mahwah, NJ, USA: Lawrence Erlbaum Associates Publishers, 2005, pp. 275–340. [46] I. Williams, "A speededness item response model for associating ability and speededness parameters," Ph.D. dissertation, Rutgers University, New Brunswick, NJ, USA, 2017. [47] B. Shamshad and J. S. Siddiqui, "Testing Procedure for Item Response Probabilities of 2Class Latent Model," Mehran University Research Journal of Engineering and Technology, vol. 39, no. 3, pp. 657–667, Jul. 2020, https://doi.org/10.22581/muet1982.2003.20. [48] K. M. Williams and B. D. Zumbo, "Item Characteristic Curve Estimation of Signal Detection Theory-Based Personality Data: A Two- Stage Approach to Item Response Modeling," International Journal of Testing, vol. 3, no. 2, pp. 189–213, Jun. 2003, https://doi.org/10.1207/ S15327574IJT0302_7. [49] D. Tafiadis et al., "Using Receiver Operating Characteristic Curve to Define the Cutoff Points of Voice Handicap Index Applied to Young Adult Male Smokers," Journal of Voice, vol. 32, no. 4, pp. 443–448, Jul. 2018, https://doi.org/10.1016/j.jvoice.2017.06.007. [50] L. Lina, D. Mardapi, and H. Haryanto, "Item Characteristics on Pro- TEFL Listening Section," presented at the First International Conference on Advances in Education, Humanities, and Language, ICEL 2019, Malang, Indonesia, 23-24 March 2019, Jul. 2019, https://dx.doi.org/ 10.4108/eai.11-7-2019.159630. [51] D. L. Moody, "The method evaluation model: a theoretical model for validating information systems design methods," in European Conference on Information Systems, Naples, Italy, Jun. 2003, pp. 1–17. [52] H. Davis, T. M. Rosner, M. C. D’Angelo, E. MacLellan, and B. Milliken, "Selective attention effects on recognition: the roles of list context and perceptual difficulty," Psychological Research, vol. 84, no. 5, pp. 1249–1268, Jul. 2020, https://doi.org/10.1007/s00426-019-01153- x. [53] L. Sun, Y. Liu, and F. Luo, "Automatic Generation of Number Series Reasoning Items of High Difficulty," Frontiers in Psychology, vol. 10, 2019, Art. no. 884, https://doi.org/10.3389/fpsyg.2019.00884. [54] T. O. Abe and E. O. Omole, "Difficulty and Discriminating Indices of Junior Secondary School Mathematics Examination; A Case Study of Oriade Local Government, Osun State," American Journal of Education and Information Technology, vol. 3, no. 2, pp. 37–46, Oct. 2019, https://doi.org/10.11648/j.ajeit.20190302.12. [55] G. Nelson and S. R. Powell, "Computation Error Analysis: Students With Mathematics Difficulty Compared To Typically Achieving Students," Assessment for Effective Intervention, vol. 43, no. 3, pp. 144– 156, Jun. 2018, https://doi.org/10.1177/1534508417745627. [56] H. Retnawati, B. Kartowagiran, J. Arlinwibowo, and E. Sulistyaningsih, "Why Are the Mathematics National Examination Items Difficult and What Is Teachers’ Strategy to Overcome It?," International Journal of Instruction, vol. 10, no. 3, pp. 257–276, Jul. 2017. [57] T. A. Holster, J. W. Lake, and W. R. Pellowe, "Measuring and predicting graded reader difficulty," vol. 29, no. 2, pp. 218–244, Oct. 2017. [58] S. Gaitas and M. A. Martins, "Teacher perceived difficulty in implementing differentiated instructional strategies in primary school," International Journal of Inclusive Education, vol. 21, no. 5, pp. 544– 556, May 2017, https://doi.org/10.1080/13603116.2016.1223180. [59] J. L. D’Sa and M. L. Visbal-Dionaldo, "Analysis of Multiple Choice Questions: Item Difficulty, Discrimination Index and Distractor Efficiency," International Journal of Nursing Education, vol. 9, no. 3, pp. 109–114, 2017. [60] A. H. Blasi and M. Alsuwaiket, "Analysis of Students’ Misconducts in Higher Education using Decision Tree and ANN Algorithms," Engineering, Technology & Applied Science Research, vol. 10, no. 6, pp. 6510–6514, Dec. 2020, https://doi.org/10.48084/etasr.3927. [61] N. Sharifi, M. Falsafi, N. Farokhi, and E. Jamali, "Assessing the optimal method of detecting Differential Item Functioning in Computerized Adaptive Testing," Quarterly of Educational Measurement, vol. 9, no. 33, pp. 23–51, Oct. 2018, https://doi.org/10.22054/jem.2019.11109. 1323. [62] J. J. Hox, C. J. M. Maas, and M. J. S. Brinkhuis, "The effect of estimation method and sample size in multilevel structural equation modeling," Statistica Neerlandica, vol. 64, no. 2, pp. 157–170, 2010, https://doi.org/10.1111/j.1467-9574.2009.00445.x. [63] G. Makransky, L. Lilleholt, and A. Aaby, "Development and validation of the Multimodal Presence Scale for virtual reality environments: A confirmatory factor analysis and item response theory approach," Computers in Human Behavior, vol. 72, pp. 276–285, Jul. 2017, https://doi.org/10.1016/j.chb.2017.02.066. [64] J. A. Costa, J. Maroco, and J. Pinto‐Gouveia, "Validation of the psychometric properties of cognitive fusion questionnaire. A study of the factorial validity and factorial invariance of the measure among osteoarticular disease, diabetes mellitus, obesity, depressive disorder, and general populations," Clinical Psychology & Psychotherapy, vol. 24, no. 5, pp. 1121–1129, 2017, https://doi.org/10.1002/cpp.2077.