Copyright © 2018, REiD (Research and Evaluation in Education) ISSN 2460-6995 REiD (Research and Evaluation in Education), 4(2), 2018, 144-154 Available online at: http://journal.uny.ac.id/index.php/reid Mapping of physics problem-solving skills of senior high school students using PhysProSS-CAT *1Edi Istiyono; 2Wipsar Sunu Brams Dwandaru; 3Revnika Faizah 1Department of Educational Research and Evaluation, Universitas Negeri Yogyakarta Jl. Colombo No. 1, Depok, Sleman, Yogyakarta 55281, Indonesia 2,3Department of Physics Education, Universitas Negeri Yogyakarta Jl. Colombo No. 1, Depok, Sleman, Yogyakarta 55281, Indonesia *Corresponding Author. E-mail: edi_istiyono@uny.ac.id Submitted: 04 December 2018 | Revised: 19 December 2018 | Accepted: 20 December 2018 Abstract Evaluation using computerized adaptive tests (CAT) is an alternative to paper-based tests (PBT). This study was aimed at mapping physics problem-solving skills using PhysProSS-CAT on the basis of the item response theory (IRT). The study was conducted inSleman Regency, Yogyakarta, involving 156 students of Grade XI of senior high school. Sampling was done using stratified random sampling technique. The results of the study show that the PhysProSS-CAT is able to accurately measure physics problem-solving skills. Students’ competences in physics problem solving can be mapped as 6% of the very high category, 4% of the high category, 36% of the medium category, 36% of the low category, and 18% of the very low category. This shows that the majority of the students’ competences in physics problem solving lies within the categories of medium and low. Keywords: assessment, problem-solving skill, CAT Introduction One of the 21st-century learning and innovation skills is the ability related to critical thinking, problem solving, technology, and in- formation (Daryanto & Karim, 2017). Tech- nology is an integral aspect of the develop- ment of a nation. The more advanced the cul- tures of a nation, the more varied and compli- cated the technology that is used. Problem solving is a cognitive process directed to the attainment of an objective when there is a solution method to solve a problem (Bueno, 2014). Physics learning highly needs problem- solving skills; it is, therefore, necessary to have an evaluation as one of the efforts in ele- vating the learners’ thinking skills. Nitko and Brookhart (2011, p. 3) define evaluation as a process to obtain information for making decisions concerning the learners, curriculum, program, school, and educational policy. Evaluation instruments used in learn- ing covers tests and non-tests (Nitko & Brookhart, 2011). Test-type instruments can be further grouped into objective tests and non-objective tests. Objective tests can be in the form of multiple-choice, short answers, matching, and objective essays. Non-objective tests can be open essays, work performance or observation, and portfolios or project tasks (Mundilarto, 2010, p. 52). Multiple-choice test items can be used to assess learning more complex outcomes which are concerned with the aspects of recall, understanding, appli- cation, analysis, synthesis, and also evaluation (Arifin, 2016, p. 138). The administering of the test can be done in two modes: paper- pencil and computer-based test (CBT). The paper-pencil test is paper-based test (PBT) as has been done for long, while CBT is com- puter-based (Pakpahan, 2016, p. 24). PBT is based on the assumption that learners with the same level of age and educa- tion have the same level of competences. In REiD (Research and Evaluation in Education), 4(2), 2018 ISSN 2460-6995 145 – Mapping of physics problem-solving skills... Edi Istiyono, Wipsar Sunu Brams Dwandaru, & Revnika Faizah reality, there is, however, a significant varia- tion (Bagus, 2012, pp. 45–46). The PBT mod- el has many shortcomings especially related to deviating behaviors, such as frauds, discus- sions, sharing of answer keys, or even teach- ers or schools giving out answers keys with the intention that the teachers or schools are not regarded as failing in the running of edu- cation and learning by the society (Balan, Sudarmin, & Kustiono, 2017, p. 37). Further, Retnawati (2014, p. 190) states that Indonesia is a big archipelago consisting tens of prov- inces. As such, distribution of test packages from the centre to the regions faces many ob- stacles including, for example, during the na- tional examination (NE). This causes, among others, test administration to be impartial and tests results not valid in that they do not re- present the real competences of the students. These limitations of PBT can be overcome by testing using the computer. Computer-based testing has some ad- vantages, including: there is no need to wait for weeks for testees to receive their scores; scores can be obtained immediately. CBT also provides the facility for giving each testee test items that are pre-arranged to give the testee the freedom to select the next test item (Miller, Linn, & Gronlund, 2009, p. 12). Ac- cording to Luecht and Sireci (2011), the CBT model can be categorized into: (1) computer- ized fixed tests (CFT); (2) linear-on-the-fly tests (LOFT); (3) computerized adaptive tests (CAT); (4) stratified computerized adaptive tests (AS); (5) content-constrained CAT with shadow tests; (6) test-based CAT and multi- stage computerized mastery tests (combined); and (7) computer-adaptive multistage tests. Each model has its own advantages and disadvantages. CBT gives more advantages than PBT does in that, among others, its scoring system is automatic and it reduces the burdens on the part of the testees (Riley & Carle, 2012). However, CBT is similar to PBT in that it may not be able to measure the testees’ abilities accurately since there is still a potential of fraud in its administration. CBT makes the testees respond to all of the items so that there is inefficiency in the use of time. There are two theories in assessment that have been empirically and technologically developed. These are classical test theory (CTT) and item response theory (IRT). Both CTT and IRT widely represent two different frames of assessment. In views of the CTT, scoring of a test is done partially, using the steps that need to be taken in answering a test item correctly. Scoring is conducted step by step, each testee’s item score is obtained by summing up the score in each step, and achievement is estimated from raw scores. This scoring model may not be appropriate since the difficulty level of each step is not taken into consideration (Istiyono, Mardapi, & Suparno, 2014, p. 4). In the item level, the CTT model is relatively simple; CTT does not demand a complex theoretical model to relate a testee’s success in responding to a test item. On the contrary, CTT collectively considers a group of testees for a particular item. IRT has been developed and important to comple- ment CTT in the design, interpretation, and evaluation of a test or examination. IRT has a strong mathematical basis and relies on a complex algorithm more efficiently calculated on the computer (Adedoyin, 2010, p. 108). IRT supports the use of the computer in edu- cational testing. IRT can be used to provide any item saved in the computer independent- ly, so that the computer select a test from item banks, manage the procedure of the item administering, or design a model for a new computer-based item-response test (Masters & Keeves, 1999, p. 139; van der Linden & Glas, 2003). Thus, a test which uses CAT is highly suitable with the item response theory (IRT). Hambleton, Swaminathan, and Rogers (1991, p. 9) propose three assumptions under- lying the item response theory, including: (1) the chance for answering an item is not de- pendent on that for another item (local in- dependence), (2) an item measures one com- petence dimension (unidimensional), and (3) the response pattern of each item can be re- presented in an item characteristic curve. The weaknesses of the classical theory are tackled up by these three assumptions. Hambleton et al. (1991) identify four limitations of the clas- sical theory. First, item statistics such as diffi- culty levels and discriminating powers are re- stricted by specific observed samples that are REiD (Research and Evaluation in Education), 4(2), 2018 ISSN 2460-6995 Mapping of physics problem-solving skills... - 146 Edi Istiyono, Wipsar Sunu Brams Dwandaru, & Revnika Faizah obtained; i.e. they depend on the group and test. Second, reliability is defined by parallel- test concepts, which are difficult to realize in practice. This is due to the fact that individ- uals can never be the same in the second test since they may forget, earn new competences, or have different motivation and anxiety lev- els. Third, standard errors of measurement are assumed to be the same for all subject matters and variabilities in errors are not being consi- dered. Fourth, the classical theory reflects fo- cus on the test-level information to put item- level information aside. Test-level information is an additive process, that is, the amount of information across the item, and item-level information is the information only for cer- tain items. These limitations show that the classical theory deals with individual score to- tals and not each testee’s competences in the individual level. A CAT is based on the item response theory. Hambleton and Swaminathan (1985, p. 48), state that there are three types of scor- ing systems: dichotomous, polytomous, and continuous. Of the three, dichotomous sys- tem is the most used in the educational evalu- ation. The models that can be used for the dichotomous data are latent linearity, perfect scale, latent distance, Ogive one-two-three normal parameter, one-two-three logistic pa- rameter, and four logistic parameter (Barton & Lord, 1981; Guttman, 1944; Lazarsfeld & Henry, 1968; Lord, 1952). The dichotomous model is only suitable for items with two- category scores such as true/false. For items with more than two score categories, the po- lytomous system is used. The polytomous scoring system has a number of models, such as nominal response, graded response, partial credit model, and others (Bock, 1972; Geoff N. Masters, 1982; Samejima, 1969). The partial credit model (PCM) has been developed in order to analyze the test items which require multiple-step re- sponses, wherein the items follow the partial credit model patterns so that individuals with higher competences will score higher than those who have lower competences (Istiyono, 2017, p. 2). Therefore, it is reasonable that the partial credit model is used for multiple- choice tests. A CAT is based on the principles that items must be selected by a consideration that they must measure the testees’ competences. Generally, an item is selected in that it gives the most information to estimate the testee’s competences. Then, based on the true/false response pattern, the competence level is sup- posed to return and the item is selected on the basis of the newly estimated competence. These processes are then continued up to a certain precision of the obtained testee’s com- petences (Hambleton & Zaal, 1991). Based on the discussion of these facts, a need is felt on the development of a test that will measure the testees’ competences in problem solving. The computerized adaptive test (CAT) has been developed as a CBT alternative to exam- ine PBT tests and provide better tests items and shorter tests in accordance with each test. CAT is a testing system which is more ad- vanced than CBT (Hadi, 2013, p. 12). In ac- cordance with Suyoso, Istiyono, and Subroto (2017), computer-based evaluation is needed more and can help teachers in conducting an evaluation in their subject-matter teaching. In the 21st century, more is emphasized on the higher-order thinking cognitive domain such as HOTS Bloomian, HOTS Marzonian, criti- cal thinking, creative thinking and problem solving (Brookhart, 2010; Heong et al., 2011; Schraw & Robinson, 2011). Testees interact directly with the computer containing the test items of the subject matter. They work on answering test items through the computer as they do in PBT through writing. The number of items is the same that in PBT and item characteristics do not function as they do in CAT (Pakpahan, 2016, pp. 26–27). The use of CAT does not require items in a great number since the computer is able to give the items in accordance with the test- ees’ competence levels. On the contrary, PBT, which is developed by classical theories, needs items in a great number since it needs to mea- sure the testees’ optimum competences re- peatedly (Gregory, 2014). According to Weiss (2004, p. 82), CAT is a technology that is viable to have the potentials to give a better assessment, in smaller testing time, for various application in counseling and education. In these two fields, there are needs to measure REiD (Research and Evaluation in Education), 4(2), 2018 ISSN 2460-6995 147 – Mapping of physics problem-solving skills... Edi Istiyono, Wipsar Sunu Brams Dwandaru, & Revnika Faizah individuals’ changes. There are so many vari- eties in the evaluation applications, and one that is able to make use of the superiority of assessment applications which are good and efficient is that which applies the CAT tech- nologies. Method The study was conducted in State Sen- ior High School in Sleman Regency, Yogya- karta Province, during the even semester of the 2017-2018 academic year. The subjects of the study were 156 students of the Physics Department selected by a stratified random sampling technique taking the higher, medi- um, and lower groups into consideration based on the students’ scores of the National Examination in Physics. The size of the sam- ple was determined from the population using the 1-PL formula that ended with 150 to 250 students (Linacre, 2006). Data collection was conducted by a test that was used to map students’ competences in problem solving in the field of physics. The research participants were asked to take the PhysProSS-CAT test which was the product of this research development. The PhysProSS-CAT consists of items that have undergone development in the forms of multiple-choice items with reasons. The mate-rial is related to the balance of solid things, elasticity and Hooke law, static fluid, dynamic fluid, and temperature and calorie. The development of the instrument was based on the Curriculum 2013 which had been revised on the aspects and sub-aspects of problem-solving skills (Ministry of Educa- tion and Culture, 2013). The aspects included identification, planning, implementation, and evaluation. The sub-aspects included identify- ing, differentiating, planning, formulating, se- quencing, connecting, applying, checking, and criticizing. The test was developed into four sets of test items, 180 in total with nine an- chor items. The test items had the characteristics that fulfilled the requirements for testing. These requirements were as follows: (a) Based on the results of the content validation by the evaluation experts, the test was content-wise valid with Aiken’s V value of 0.97; (b) Based on the empirical evidence, the test had a fit with the Partial Credit Model (PCM) poly- atomic data with four categories with a mean score and INFIT MNSQ standard deviation of 1.00±0.25; (c) Based on the Cronbach Alpha reliability estimation values, all items were regarded as reliable at the measure of 0.93; (d) Based on the levels of difficulty, the test was regarded as good with a range of - 1.23 to 1.50; and (e) On the information func- tion and SEM, the test was stated to be able to estimate competences on the range be- tween -2 and 1.6. The scoring of the test used the partial credit model (PCM) technique which was a development of the 1-PL model and was of the Rash family. Meanwhile, the results of the physics problem-solving test used the com- puterized adaptive test (CAT) categorized in the form of levels adapted from (Azwar, 2010). The categories are shown in Table 1. Table 1. Intervals of students’ problem- solving skills No Skill Interval Level 1 Mi + 1.5SBi<θ VeryHigh 2 Mi + 0.5SBi<θ ≤ Mi + 1.5SBi High 3 Mi + 0.5SBi<θ ≤ Mi – 0.5SBi Medium 4 Mi – 1.5SBi<θ ≤ Mi – 0.5SBi Low 5 θ